SlideShare a Scribd company logo
1 of 60
Download to read offline
What’s in a Number?
         monitoring and measuring;
         the spiritual and the carnal.




Monday, September 19, 11
Theo Schlossnagle                                               @postwait

              Entrepreneur: OmniTI, Message Systems, Circonus
              Author: Scalable Internet Architectures, Web Operations
              Speaker: over 50 speaking engagements around the world
              Engineer: Senior ACM Member, IEEE Member, ACM Queue
              Open Source:
                      Apache Traffic Server, Varnish, Node.js, node-iptrie, node-amqp,
                      Solaris HTTP accept filers, OpenSSH + SecurID, JLog, Mungo,
                      Portable umem, Reconnoiter, mod_log_spread/spreadlogd, Spread,
                      Wackamole, Zetaback, concurrencykit, Net::RabbitMQ, MCrypt,
                      node-gmp, node-compress, and others.


Monday, September 19, 11
Monitoring ~ Voyeurism

              Watching from the outside in.
              Observing systems and their behaviors.
              Ultimately, a single observation distills into two things:
                    a set of circumstances
                    a value




Monday, September 19, 11
Circumstances: Dimensions
              What are you looking at?
                    outbound octets on
                    an Ethernet port (1/37) on a switch
                    in rack 12, in datacenter 6
                    for client ACME
              What time is it?



Monday, September 19, 11
Circumstances: Dimensions
              What are you looking at?
                    length of time until DOM ready for
                    a user loading a web page (/store/foo/item/bar)
                    using Chrome 10
                    from Verizon FioS in Fulton, Maryland.
              What time is it?



Monday, September 19, 11
Value
              Ultimately, there are two types of values we care about:
                    text
                           an version number, http code, ssh fingerprint
                    numeric
                           0.0000000000093123,
                           18446744064986053382,
                           1.1283e97, 6.33e-23,
                           0, -1, 1000, 42


Monday, September 19, 11
Text Data
              Incredibly useful,
              because the “when” can be correlated
              with other events and conditions.
              Idea:
                    if the distinct set of text values is sufficiently small,
                    consider the text value as a dimension, and
                    the new value = 1



Monday, September 19, 11
Text Metrics in Use




              Code deployments vs. process activity



Monday, September 19, 11
Numeric Data

              There tends to be a lot of it.
              Example web hits:
                    one box, one metric, 1 billion / month.
                    that’s only 400 events/second.
                    each event has about 20 dimensions.




Monday, September 19, 11
Numeric Metrics in Use




              672px ≅ 4 weeks (2419200s)   or   3600s ≅ 1px



Monday, September 19, 11
Numeric Metrics in Use




              672px ≅ 4 weeks (2419200s) or 3600s ≅ 1px
              in previous example: 24000 samples per ‘point’


Monday, September 19, 11
Numeric Metrics in Use




              672px ≅ 4 weeks (2419200s) or 3600s ≅ 1px
              in previous example: 24000 samples per ‘point’
              even at one minute samples: 60 samples per ‘point’
Monday, September 19, 11
An Event
         each is a singularity:
         individual;
         unrepeatable;
         unique.
                                  mondolithic.com




Monday, September 19, 11
Passive vs. Active
              Active is a synthesized event.
                    You perform some action an measure it.
              Passive is the collection of other events.
                    Stuff that really happened.


              Ask yourself which is better.



Monday, September 19, 11
Conceding the Truth
              Passive events are ideal in high traffic environments.
              Active events are necessary in low traffic environments.


              “high and low” are relative, so...
                    you need both
                    always



Monday, September 19, 11
Numbers: how they work.
              Think of a car’s speedometer.
                    it shows you how fast you are going (sorta)
                    at the moment you read it (almost)
                    the only reason this is acceptable is:
                           you have a rudimentary understanding of physics,
                           you are in the car, and
                           your body is exceptionally good
                           at detecting acceleration.
Monday, September 19, 11
Computers are different.

              computers are fast: billions of instructions per second.
              you aren’t inside the computer.
              most “users” somehow forget rudimentary physics.
              interfaces and visual representations mislead you.
              a speedometer (or familiar readout) leads you to
              a false confidence in system stability between samples.



Monday, September 19, 11
Write IOPS
         Simple, relatively stable, easy to trend...
         and wrong.
Monday, September 19, 11
Zooming in shows insight.
         It is not as smooth as we thought.
         Good to know.
Monday, September 19, 11
Dishonesty Revealed.
         This is “inside” a 5 minute sample.
         Note: we’re still sampling; now at 0.5 Hz.
Monday, September 19, 11
You’ve Been Deceived.

         1. Denial and Isolation
         2. Anger
         3. Bargaining
         4. Depression
         5. Acceptance




Monday, September 19, 11
You’ve Been Deceived.
                                   let’s just skip this one, okay?

         1. Denial and Isolation
         2. Anger
         3. Bargaining
         4. Depression
         5. Acceptance




Monday, September 19, 11
You’ve Been Deceived.
                                   let’s just skip this one, okay?

         1. Denial and Isolation
                                   If you’re not angry, get angry:
         2. Anger                    it’s the next step in healing

         3. Bargaining
         4. Depression
         5. Acceptance




Monday, September 19, 11
You’ve Been Deceived.
                                        let’s just skip this one, okay?

         1. Denial and Isolation
                                       If you’re not angry, get angry:
         2. Anger                        it’s the next step in healing

         3. Bargaining
                                   This is where it gets interesting.
         4. Depression
         5. Acceptance




Monday, September 19, 11
Bargaining.

              I can’t reasonably store all the data:
                    likely true.
              Since I can’t store all the data,
              I have to store a summarization
                    yes; they’re called “aggregates.”




Monday, September 19, 11
Bargaining: you’re losing.
              You are sampling things and losing data.
                    the speedometer example.


              or


              You are taking an average of a large set of samples.



Monday, September 19, 11
“Average” is a bad bargain.

              The average of a set of numbers is:
                    the single most useful statistic
              By itself, it provides very little insight into
              the nature of the original set.
              Hello statistics!




Monday, September 19, 11
Rounding out your statistics.
                You should store, in order of importance:
                    average                     derivatives
                    cardinality
                    standard deviation          derivative stddev
                    covariance                  derivative covariance
                    min, max                    derivative min, max




Monday, September 19, 11
Rounding out your statistics.
                You should store, in order of importance:
                    average                        derivatives
                    cardinality
                    standard deviation             derivative stddev
                    covariance                     derivative covariance
                    min, max                       derivative min, max

                    FWIW: we don’t store these at Circonus (today)

Monday, September 19, 11
What statistics give you.

              With the cardinality, average and stddev
                    you can understand if a “reasonable” number of the
                    data points are “near” the average.
              However, this is often misleading itself.
              We expect normal distributions or
              (more likely) gamma distributions



Monday, September 19, 11
So, we can make due.


              No.
              Let’s revisit the five stages of grieving.




Monday, September 19, 11
You’ve Been Deceived.
                                        let’s just skip this one, okay?

         1. Denial and Isolation
                                       If you’re not angry, get angry:
         2. Anger                        it’s the next step in healing

         3. Bargaining
                                   This is where it gets interesting.
         4. Depression
         5. Acceptance




Monday, September 19, 11
You’ve Been Deceived.
                                        let’s just skip this one, okay?

         1. Denial and Isolation
                                       If you’re not angry, get angry:
         2. Anger                        it’s the next step in healing

         3. Bargaining
                                   This is where it gets interesting.
         4. Depression
         5. Acceptance               Here we are.




Monday, September 19, 11
Gamma Gamma Gamma
                                                   average: 0.72




              0            1   2   3   4   5   6   7   8   9       10




Monday, September 19, 11
Gamma Gamma Gamma




              0            1   2   3   4   5   6   7   8   9   10




Monday, September 19, 11
Gamma Gamma Gamma
                                                   average: 1.16




              0            1   2   3   4   5   6   7   8   9       10




Monday, September 19, 11
Gamma Gamma Gamma




              0            1   2   3   4   5   6   7   8   9   10




Monday, September 19, 11
Gamma Gamma Gamma




Monday, September 19, 11
Gamma Gamma Gamma
                             average: 4.67




Monday, September 19, 11
Gamma Gamma Gamma
                                                   average: 4.67




              0            1   2   3   4   5   6   7   8   9




Monday, September 19, 11
Do we need more?
              Wait. Stop.
              In some systems we know enough.
              We know their distributions,
              ... the leather seat, the inertia of the vessel.
              A sample here and there, an average, etc.
              Good enough: temperature, load, payments, bugs.
              Not enough: disk IOPS, memory use, CPU time.


Monday, September 19, 11
We do need more.


              we have a lot of data sources from which
              we typically compose an average of many values, and
              understanding the distribution is extremely valuable.




Monday, September 19, 11
Histograms: the need.
              Given an average of 4, and a cardinality of 1000
              maybe you had 1000 samples of exactly 4?
              Dunno.
              If you have a stddev of 1,
              about 68.2% of your
              data points will be
              between 3 and 5.


              This will never be as detailed as a
              histogram of the full dataset.

Monday, September 19, 11
Histograms: the want

              Ideally, we’d want histograms
              bucketed as we see fit
              with all data represented.
              Too much data: this isn’t tractable.
              So, how do we bucket?




Monday, September 19, 11
Bucketing Basics
              If I had numbers bounded by a small space:
                    between 0ms and 10ms
              I could bucket [0-1),[1-2),...[9,10).
                    eleven buckets, fairly manageable.
              What do I do when I get 20? or 200? or 20000?
              What do I do when I get lots of data
              between 0.001 and 1?


Monday, September 19, 11
How we do more.

              We make an assumption (yes: ass: u & mption)
              We assume the same thing floating point assumes:
                    The difference between “very large” numbers and the
                    difference between “very small” numbers can have
                    accuracy inversely proportional to their magnitude.
              We take this and build histograms.



Monday, September 19, 11
Solution: log buckets
              [0.001,0.01),                 [2-10,2-9),[2-9,2-8),[2-8,2-7),
              [0.01,0.1),                   [2-7,2-6),[2-6,2-5),[2-5,2-4),
              [0.1,1),                      [2-4,2-3),[2-3,2-2),[2-2,2-1),
              [1,10),                       [2-1,1),[1,2), [2,4), [4,8),
              [10,100),                     [8,16),[16,32),[32,64),
              [100,1000),                   [64,128),[27,28),[29,210),
              [1000,10000),                 [210,211),[211,212),[212,213),
              [10000,100000)                [213,214),[214,215)

                           Engineers might like these,
                              but humans do not.

Monday, September 19, 11
Histograms: the compromise

              Log-linear:
                    0.1 , 0.15 , 0.2 , 0.25 , 0.3 , 0.35 , 0.4 , 0.45
                    0.5 ,        0.6 ,        0.7 ,        0.8 ,      0.9
                    1.0 , 1.5 , 2.0 , 2.5 , 3.0 , 3.5 , 4.0 , 4.5
                    5.0 ,        6.0 ,        7.0 ,        8.0 ,      9.0
                    10...
              I first witnessed this in DTrace courtesy of Bryan Cantrill



Monday, September 19, 11
Where are we now?
              For some data, we store simple statistical aggregations
              Other bits of data we use log-linear histograms
                    quite insightful, while
                    still small enough to store
              A single “five minute average” can look more like this:




Monday, September 19, 11
You’ve Been Deceived.
                                        let’s just skip this one, okay?

         1. Denial and Isolation
                                       If you’re not angry, get angry:
         2. Anger                        it’s the next step in healing

         3. Bargaining
                                   This is where it gets interesting.
         4. Depression
         5. Acceptance               We are still here...
                                     at least in the open source world.



Monday, September 19, 11
Reconnoiter

              Reconnoiter:
                    https://labs.omniti.com/labs/reconnoiter
                    https://github.com/omniti-labs/reconnoiter


              Open source (BSD)




Monday, September 19, 11
Reconnoiter: backend
              C (as God intended),
              with lua bindings to ease plugin development
              Java (external) bridge for making things like JMX easier
              Esper (Java) for streaming queries (fault detection)
              AMQP and REST for component intercommunication
              PostgreSQL backing store
              Very actively developed


Monday, September 19, 11
Reconnoiter: frontend

              HTML + CSS
              Javascript: JQuery, flot
              PHP for AJAX APIs


              Needs engineers.




Monday, September 19, 11
Features (1)
              Data Collection:
                    most standard protocols (SNMP, REST, JDBC, etc.)
                    some nice bridges to legacy systems:
                    (nagios, munin, etc.)
              Data Storage:
                    supports sharding
                    never delete old data


Monday, September 19, 11
Features (2)
                    We don’t do histograms now
                    Because all the old data is stored, we could
                           If we have enough data to make this worthwhile,
                           we will likley need an on-the-fly approach
                    We store raw data and rollups including:
                           average, cardinality, stddev, derivatives
                           (and counters)



Monday, September 19, 11
Features (3)
                    separates data from visualization
                    drag-n-drop graphing
                    real-time data:
                           you can run checks “often” (every 100ms?)
                           you can see that data in the browser
                           you can compose a graph from disparate data



Monday, September 19, 11
Active vs. Passive
              Reconnoiter supports passive metric submission
                    what version of code is deployed
              Reconnoiter’s design is focused on active collection
                    how may users have signed up
                    how many visitors to your site
                    how full is your disk
                    how many open bugs
                    how many hours spent

Monday, September 19, 11
Bias
              Reconnoiter doesn’t solve all your problems
              (we’re busy trying to do that with Circonus)


              I believe Reconnoiter is the best tool available
              for heterogeneous, multi-datacenter, full-architecture
              monitoring.


              I am seriously biased.


Monday, September 19, 11
You’ve Been Deceived.
                                        let’s just skip this one, okay?

         1. Denial and Isolation
                                       If you’re not angry, get angry:
         2. Anger                        it’s the next step in healing

         3. Bargaining
                                   This is where it gets interesting.
         4. Depression
         5. Acceptance               Here we are.


                            This is on you. Good luck.

Monday, September 19, 11
Thank You!
         Questions?

Monday, September 19, 11

More Related Content

Viewers also liked

Viewers also liked (14)

Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Project reality
Project realityProject reality
Project reality
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 

More from Theo Schlossnagle

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to ComplexityTheo Schlossnagle
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwareTheo Schlossnagle
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or NotTheo Schlossnagle
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service designTheo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoringTheo Schlossnagle
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachTheo Schlossnagle
 
Applying operations culture to everything
Applying operations culture to everythingApplying operations culture to everything
Applying operations culture to everythingTheo Schlossnagle
 

More from Theo Schlossnagle (13)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Web Operations Career
Web Operations CareerWeb Operations Career
Web Operations Career
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
 
Applying operations culture to everything
Applying operations culture to everythingApplying operations culture to everything
Applying operations culture to everything
 

Recently uploaded

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

What's in a number?

  • 1. What’s in a Number? monitoring and measuring; the spiritual and the carnal. Monday, September 19, 11
  • 2. Theo Schlossnagle @postwait Entrepreneur: OmniTI, Message Systems, Circonus Author: Scalable Internet Architectures, Web Operations Speaker: over 50 speaking engagements around the world Engineer: Senior ACM Member, IEEE Member, ACM Queue Open Source: Apache Traffic Server, Varnish, Node.js, node-iptrie, node-amqp, Solaris HTTP accept filers, OpenSSH + SecurID, JLog, Mungo, Portable umem, Reconnoiter, mod_log_spread/spreadlogd, Spread, Wackamole, Zetaback, concurrencykit, Net::RabbitMQ, MCrypt, node-gmp, node-compress, and others. Monday, September 19, 11
  • 3. Monitoring ~ Voyeurism Watching from the outside in. Observing systems and their behaviors. Ultimately, a single observation distills into two things: a set of circumstances a value Monday, September 19, 11
  • 4. Circumstances: Dimensions What are you looking at? outbound octets on an Ethernet port (1/37) on a switch in rack 12, in datacenter 6 for client ACME What time is it? Monday, September 19, 11
  • 5. Circumstances: Dimensions What are you looking at? length of time until DOM ready for a user loading a web page (/store/foo/item/bar) using Chrome 10 from Verizon FioS in Fulton, Maryland. What time is it? Monday, September 19, 11
  • 6. Value Ultimately, there are two types of values we care about: text an version number, http code, ssh fingerprint numeric 0.0000000000093123, 18446744064986053382, 1.1283e97, 6.33e-23, 0, -1, 1000, 42 Monday, September 19, 11
  • 7. Text Data Incredibly useful, because the “when” can be correlated with other events and conditions. Idea: if the distinct set of text values is sufficiently small, consider the text value as a dimension, and the new value = 1 Monday, September 19, 11
  • 8. Text Metrics in Use Code deployments vs. process activity Monday, September 19, 11
  • 9. Numeric Data There tends to be a lot of it. Example web hits: one box, one metric, 1 billion / month. that’s only 400 events/second. each event has about 20 dimensions. Monday, September 19, 11
  • 10. Numeric Metrics in Use 672px ≅ 4 weeks (2419200s) or 3600s ≅ 1px Monday, September 19, 11
  • 11. Numeric Metrics in Use 672px ≅ 4 weeks (2419200s) or 3600s ≅ 1px in previous example: 24000 samples per ‘point’ Monday, September 19, 11
  • 12. Numeric Metrics in Use 672px ≅ 4 weeks (2419200s) or 3600s ≅ 1px in previous example: 24000 samples per ‘point’ even at one minute samples: 60 samples per ‘point’ Monday, September 19, 11
  • 13. An Event each is a singularity: individual; unrepeatable; unique. mondolithic.com Monday, September 19, 11
  • 14. Passive vs. Active Active is a synthesized event. You perform some action an measure it. Passive is the collection of other events. Stuff that really happened. Ask yourself which is better. Monday, September 19, 11
  • 15. Conceding the Truth Passive events are ideal in high traffic environments. Active events are necessary in low traffic environments. “high and low” are relative, so... you need both always Monday, September 19, 11
  • 16. Numbers: how they work. Think of a car’s speedometer. it shows you how fast you are going (sorta) at the moment you read it (almost) the only reason this is acceptable is: you have a rudimentary understanding of physics, you are in the car, and your body is exceptionally good at detecting acceleration. Monday, September 19, 11
  • 17. Computers are different. computers are fast: billions of instructions per second. you aren’t inside the computer. most “users” somehow forget rudimentary physics. interfaces and visual representations mislead you. a speedometer (or familiar readout) leads you to a false confidence in system stability between samples. Monday, September 19, 11
  • 18. Write IOPS Simple, relatively stable, easy to trend... and wrong. Monday, September 19, 11
  • 19. Zooming in shows insight. It is not as smooth as we thought. Good to know. Monday, September 19, 11
  • 20. Dishonesty Revealed. This is “inside” a 5 minute sample. Note: we’re still sampling; now at 0.5 Hz. Monday, September 19, 11
  • 21. You’ve Been Deceived. 1. Denial and Isolation 2. Anger 3. Bargaining 4. Depression 5. Acceptance Monday, September 19, 11
  • 22. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation 2. Anger 3. Bargaining 4. Depression 5. Acceptance Monday, September 19, 11
  • 23. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining 4. Depression 5. Acceptance Monday, September 19, 11
  • 24. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining This is where it gets interesting. 4. Depression 5. Acceptance Monday, September 19, 11
  • 25. Bargaining. I can’t reasonably store all the data: likely true. Since I can’t store all the data, I have to store a summarization yes; they’re called “aggregates.” Monday, September 19, 11
  • 26. Bargaining: you’re losing. You are sampling things and losing data. the speedometer example. or You are taking an average of a large set of samples. Monday, September 19, 11
  • 27. “Average” is a bad bargain. The average of a set of numbers is: the single most useful statistic By itself, it provides very little insight into the nature of the original set. Hello statistics! Monday, September 19, 11
  • 28. Rounding out your statistics. You should store, in order of importance: average derivatives cardinality standard deviation derivative stddev covariance derivative covariance min, max derivative min, max Monday, September 19, 11
  • 29. Rounding out your statistics. You should store, in order of importance: average derivatives cardinality standard deviation derivative stddev covariance derivative covariance min, max derivative min, max FWIW: we don’t store these at Circonus (today) Monday, September 19, 11
  • 30. What statistics give you. With the cardinality, average and stddev you can understand if a “reasonable” number of the data points are “near” the average. However, this is often misleading itself. We expect normal distributions or (more likely) gamma distributions Monday, September 19, 11
  • 31. So, we can make due. No. Let’s revisit the five stages of grieving. Monday, September 19, 11
  • 32. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining This is where it gets interesting. 4. Depression 5. Acceptance Monday, September 19, 11
  • 33. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining This is where it gets interesting. 4. Depression 5. Acceptance Here we are. Monday, September 19, 11
  • 34. Gamma Gamma Gamma average: 0.72 0 1 2 3 4 5 6 7 8 9 10 Monday, September 19, 11
  • 35. Gamma Gamma Gamma 0 1 2 3 4 5 6 7 8 9 10 Monday, September 19, 11
  • 36. Gamma Gamma Gamma average: 1.16 0 1 2 3 4 5 6 7 8 9 10 Monday, September 19, 11
  • 37. Gamma Gamma Gamma 0 1 2 3 4 5 6 7 8 9 10 Monday, September 19, 11
  • 38. Gamma Gamma Gamma Monday, September 19, 11
  • 39. Gamma Gamma Gamma average: 4.67 Monday, September 19, 11
  • 40. Gamma Gamma Gamma average: 4.67 0 1 2 3 4 5 6 7 8 9 Monday, September 19, 11
  • 41. Do we need more? Wait. Stop. In some systems we know enough. We know their distributions, ... the leather seat, the inertia of the vessel. A sample here and there, an average, etc. Good enough: temperature, load, payments, bugs. Not enough: disk IOPS, memory use, CPU time. Monday, September 19, 11
  • 42. We do need more. we have a lot of data sources from which we typically compose an average of many values, and understanding the distribution is extremely valuable. Monday, September 19, 11
  • 43. Histograms: the need. Given an average of 4, and a cardinality of 1000 maybe you had 1000 samples of exactly 4? Dunno. If you have a stddev of 1, about 68.2% of your data points will be between 3 and 5. This will never be as detailed as a histogram of the full dataset. Monday, September 19, 11
  • 44. Histograms: the want Ideally, we’d want histograms bucketed as we see fit with all data represented. Too much data: this isn’t tractable. So, how do we bucket? Monday, September 19, 11
  • 45. Bucketing Basics If I had numbers bounded by a small space: between 0ms and 10ms I could bucket [0-1),[1-2),...[9,10). eleven buckets, fairly manageable. What do I do when I get 20? or 200? or 20000? What do I do when I get lots of data between 0.001 and 1? Monday, September 19, 11
  • 46. How we do more. We make an assumption (yes: ass: u & mption) We assume the same thing floating point assumes: The difference between “very large” numbers and the difference between “very small” numbers can have accuracy inversely proportional to their magnitude. We take this and build histograms. Monday, September 19, 11
  • 47. Solution: log buckets [0.001,0.01), [2-10,2-9),[2-9,2-8),[2-8,2-7), [0.01,0.1), [2-7,2-6),[2-6,2-5),[2-5,2-4), [0.1,1), [2-4,2-3),[2-3,2-2),[2-2,2-1), [1,10), [2-1,1),[1,2), [2,4), [4,8), [10,100), [8,16),[16,32),[32,64), [100,1000), [64,128),[27,28),[29,210), [1000,10000), [210,211),[211,212),[212,213), [10000,100000) [213,214),[214,215) Engineers might like these, but humans do not. Monday, September 19, 11
  • 48. Histograms: the compromise Log-linear: 0.1 , 0.15 , 0.2 , 0.25 , 0.3 , 0.35 , 0.4 , 0.45 0.5 , 0.6 , 0.7 , 0.8 , 0.9 1.0 , 1.5 , 2.0 , 2.5 , 3.0 , 3.5 , 4.0 , 4.5 5.0 , 6.0 , 7.0 , 8.0 , 9.0 10... I first witnessed this in DTrace courtesy of Bryan Cantrill Monday, September 19, 11
  • 49. Where are we now? For some data, we store simple statistical aggregations Other bits of data we use log-linear histograms quite insightful, while still small enough to store A single “five minute average” can look more like this: Monday, September 19, 11
  • 50. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining This is where it gets interesting. 4. Depression 5. Acceptance We are still here... at least in the open source world. Monday, September 19, 11
  • 51. Reconnoiter Reconnoiter: https://labs.omniti.com/labs/reconnoiter https://github.com/omniti-labs/reconnoiter Open source (BSD) Monday, September 19, 11
  • 52. Reconnoiter: backend C (as God intended), with lua bindings to ease plugin development Java (external) bridge for making things like JMX easier Esper (Java) for streaming queries (fault detection) AMQP and REST for component intercommunication PostgreSQL backing store Very actively developed Monday, September 19, 11
  • 53. Reconnoiter: frontend HTML + CSS Javascript: JQuery, flot PHP for AJAX APIs Needs engineers. Monday, September 19, 11
  • 54. Features (1) Data Collection: most standard protocols (SNMP, REST, JDBC, etc.) some nice bridges to legacy systems: (nagios, munin, etc.) Data Storage: supports sharding never delete old data Monday, September 19, 11
  • 55. Features (2) We don’t do histograms now Because all the old data is stored, we could If we have enough data to make this worthwhile, we will likley need an on-the-fly approach We store raw data and rollups including: average, cardinality, stddev, derivatives (and counters) Monday, September 19, 11
  • 56. Features (3) separates data from visualization drag-n-drop graphing real-time data: you can run checks “often” (every 100ms?) you can see that data in the browser you can compose a graph from disparate data Monday, September 19, 11
  • 57. Active vs. Passive Reconnoiter supports passive metric submission what version of code is deployed Reconnoiter’s design is focused on active collection how may users have signed up how many visitors to your site how full is your disk how many open bugs how many hours spent Monday, September 19, 11
  • 58. Bias Reconnoiter doesn’t solve all your problems (we’re busy trying to do that with Circonus) I believe Reconnoiter is the best tool available for heterogeneous, multi-datacenter, full-architecture monitoring. I am seriously biased. Monday, September 19, 11
  • 59. You’ve Been Deceived. let’s just skip this one, okay? 1. Denial and Isolation If you’re not angry, get angry: 2. Anger it’s the next step in healing 3. Bargaining This is where it gets interesting. 4. Depression 5. Acceptance Here we are. This is on you. Good luck. Monday, September 19, 11
  • 60. Thank You! Questions? Monday, September 19, 11