SlideShare a Scribd company logo
1 of 104
© 2016 MapR Technologies 1© 2016 MapR Technologies
© 2016 MapR Technologies 2
Anomaly Detection:
How To Find What You Didn’t Know to Look For
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Ellen Friedman, Consultant and Commentator
Email ellenf@apache.org
Twitter @Ellen_Friedman
© 2016 MapR Technologies 3
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2016 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
© 2016 MapR Technologies 5
Anomaly Detection
© 2016 MapR Technologies 6
Who Needs Anomaly Detection?
Utility providers using
smart meters
© 2016 MapR Technologies 7
Who Needs Anomaly Detection?
Feedback from
manufacturing assembly
lines
© 2016 MapR Technologies 8
Who Needs Anomaly Detection?
Monitoring data traffic on
communication networks
© 2016 MapR Technologies 9
What is Anomaly Detection?
• The goal is to discover rare events
– especially those that shouldn’t have happened
• Find a problem before other people see it
– especially before it causes a problem for customers
• Why is this a challenge?
– I don’t know what an anomaly looks like (yet)
© 2016 MapR Technologies 10
Spot the Anomaly
© 2016 MapR Technologies 11
Spot the Anomaly
Looks pretty
anomalous
to me
© 2016 MapR Technologies 12
Spot the Anomaly
Will the real anomaly
please stand up?
© 2016 MapR Technologies 13
Basic idea:
Find “normal” first
© 2016 MapR Technologies 14
Steps in Anomaly Detection
• Build a model: Collect and process data for training a model
• Use the machine learning model to determine what is the normal
pattern
• Decide how far away from this normal pattern you’ll consider to
be anomalous
• Use the AD model to detect anomalies in new data
– Methods such as clustering for discovery can be helpful
© 2016 MapR Technologies 15
How hard is it to set an alert for anomalies?
Grey data is from normal events; x’s are anomalies.
Where would you set the threshold?
© 2016 MapR Technologies 16
Basic idea:
Set adaptive thresholds
© 2016 MapR Technologies 17
What Are We Really Doing
• We want action when something breaks
(dies/falls over/otherwise gets in trouble)
• But action is expensive
• So we don’t want too many false alarms
• And we don’t want too many false negatives
• What’s the right threshold to set for alerts?
– We need to trade off costs
© 2016 MapR Technologies 18
A Second Look
© 2016 MapR Technologies 19
A Second Look
99.9%-ile
© 2016 MapR Technologies 20
Cool algorithm: t-digest
© 2016 MapR Technologies 21
Online
Summarizer
99.9%-ile
t
x > t ? Alarm !
x
How Hard Can it Be?
© 2016 MapR Technologies 22
Using t-Digest
• The t-digest is an on-line percentile estimator
– very high accuracy for extreme tails
• t-digest also available everywhere
– in ElasticSearch, in Solr
– in streamlib (open source library on github)
– in Mahout Math (open source library on github)
– standalone (github and Maven Central)
• Very handy for general distributions, few assumptions
• For latency, exponential binning may be useful
– See, for instance, hdrhistorgram
© 2016 MapR Technologies 23
So are we all done?
© 2016 MapR Technologies 24
What About This?
0 5 10 15
−20246810
offset+noise+pulse1+pulse2
A
B
© 2016 MapR Technologies 25
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
© 2016 MapR Technologies 26
Spot the Anomaly
Anomaly?
© 2016 MapR Technologies 27
Maybe not!
© 2016 MapR Technologies 28
Where’s Waldo?
This is the real
anomaly
© 2016 MapR Technologies 29
Normal Isn’t Just Normal
• What we want is a model of what is normal
• What doesn’t fit the model is the anomaly
• For simple signals, the model can be simple …
• The real world is rarely so accommodating
x ~ m(t)+ N(0,e)
© 2016 MapR Technologies 30
We Do Windows
© 2016 MapR Technologies 31
We Do Windows
© 2016 MapR Technologies 32
We Do Windows
© 2016 MapR Technologies 33
We Do Windows
© 2016 MapR Technologies 34
We Do Windows
© 2016 MapR Technologies 35
We Do Windows
© 2016 MapR Technologies 36
We Do Windows
© 2016 MapR Technologies 37
We Do Windows
© 2016 MapR Technologies 38
We Do Windows
© 2016 MapR Technologies 39
We Do Windows
© 2016 MapR Technologies 40
We Do Windows
© 2016 MapR Technologies 41
We Do Windows
© 2016 MapR Technologies 42
We Do Windows
© 2016 MapR Technologies 43
We Do Windows
© 2016 MapR Technologies 44
We Do Windows
© 2016 MapR Technologies 45
Windows on the World
• The set of windowed signals is a nice model of our original signal
• Clustering can find the prototypes
– Fancier techniques available using sparse coding
• The result is a dictionary of shapes
• New signals can be encoded by shifting, scaling and adding
shapes from the dictionary
© 2016 MapR Technologies 46
Most Common Shapes (for EKG)
© 2016 MapR Technologies 47
Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
< 1 bit / sample
© 2016 MapR Technologies 48
An Anomaly
Original technique for finding
1-d anomaly works against
reconstruction error
© 2016 MapR Technologies 49
Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.
© 2016 MapR Technologies 50
A Different Kind of Anomaly
© 2016 MapR Technologies 51
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
© 2016 MapR Technologies 52
The Real Inside Scoop
• The model-delta anomaly detector is really just a sum of random
variables
– the model we know about already
– and a normally distributed error
• The output (delta) is (roughly) the log probability of the sum
distribution (really δ2)
• Thinking about probability distributions is good
© 2016 MapR Technologies 53
Some k-means Caveats
• But Eamonn Keogh says that k-means can’t work on time-series
• That is silly … and kind of correct, k-means does have limits
– Other kinds of auto-encoders are much more powerful
• More fun and code demos at
– https://github.com/tdunning/k-means-auto-encoder
http://www.cs.ucr.edu/~eamonn/meaningless.pdf
© 2016 MapR Technologies 54
The Limits of Clustering as Auto-encoder
• Clustering is like trying to tile your sample distribution
• Can be used to approximate a signal
• Filling d dimensional region with k clusters should give
• If d is large, this is no good
e » 1/ kd
© 2016 MapR Technologies 55
0 500 1000 1500 2000
−2−1012
Time series training data (first 2000 samples)
Time
Test data
Reconstruction
Error
© 2016 MapR Technologies 56
0 500 1000 1500 2000
0.000.050.100.15
Reconstruction error for time−series data
Centroids
MAVError
Training data
Held−out data
© 2016 MapR Technologies 57
Another Example
• Take points randomly in , project non-linearly into
• Approximation using clustering should give
© 2016 MapR Technologies 58
0 500 1000 1500 2000
0.00.51.01.52.0
Reconstruction error for random points
Centroids
Error
Training data
Held−out data
© 2016 MapR Technologies 59
0 500 1000 1500 2000
0.00.51.01.52.0
Error is approximately cube root of k
k
Error Actual
Cube root model
© 2016 MapR Technologies 60
Moral For Auto-encoders
• The simplest auto-encoders can be good models
• For more complex spaces/signals, more elaborate models may
be required
• Consider deep learning, recurrent networks, denoising
© 2016 MapR Technologies 61
Anomalies among sporadic events
© 2016 MapR Technologies 62
Sporadic Web Traffic to an e-Business Site
It’s important to know if traffic is stopped or
delayed because of a problem…
But visits to site normally come at
varying intervals.
How long after the last event
should you begin to worry?
© 2016 MapR Technologies 63
Sporadic Web Traffic to an e-Business Site
It’s important to know if traffic is stopped or
delayed because of a problem…
But visits to site normally come at
varying intervals.
And how do you let your CEO
sleep through the night?
© 2016 MapR Technologies 64
Basic idea:
Time interval between events is how
to convert to something useful you
can measure
© 2016 MapR Technologies 65
Sporadic Events: Finding Normal and Anomalous Patterns
• Time between intervals is much more usable than absolute times
• Counts don’t link as directly to probability models
• Time interval is log ρ
• This is a big deal
© 2016 MapR Technologies 66
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2016 MapR Technologies 67
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2016 MapR Technologies 68
But in the real world, event
rates often change
© 2016 MapR Technologies 69
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2016 MapR Technologies 70
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2016 MapR Technologies 71
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T) = e-lT
-logP(Dt > T) = lT
© 2016 MapR Technologies 72
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2016 MapR Technologies 73
Model-Scaled Intervals Solve the Problem
© 2016 MapR Technologies 74
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
log p
© 2016 MapR Technologies 75
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2016 MapR Technologies 76
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2016 MapR Technologies 77
Slipped Week: Simple Rate Predictor
Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02
0100200300400500
Main Page Traffic
Date
Hits(x1000)
A B C D
© 2016 MapR Technologies 78
Seasonality Poses a Challenge
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2016 MapR Technologies 79
Something more is needed …
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2016 MapR Technologies 80
We need a better rate predictor…
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2016 MapR Technologies 81
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
© 2016 MapR Technologies 82
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
© 2016 MapR Technologies 83
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
• Let model see many lagged values
• Use L1 regularized linear model to pick important historical
values
– We would have moved to something fancier if this hadn’t worked
© 2016 MapR Technologies 84
A New Rate Predictor for Sporadic Events
© 2016 MapR Technologies 85
Improved Prediction with Adaptive Modeling
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02468
Christmas Prediction
Date
Hits(x1000)
© 2016 MapR Technologies 86
Some days the magic works
Some days ...
We use slightly different magic
© 2016 MapR Technologies 87
Streaming Micro-Service Scenario
File upload
web service
Files
Thumbnail
extraction
Transcoding
uploads
thumbs
recodes
Files
© 2016 MapR Technologies 88
Let’s Assume a Good Micro-Architecture
Thumbnail
extraction
uploads
thumbs
metrics
exceptions
checkpoints
Input
Output
Monitoring
Restart
© 2016 MapR Technologies 89
How Can We Monitor This?
• We want to detect more than just cascading total failure
• Arrival time is good for detecting upstream complete loss
• What about misbehavior of a black box?
© 2016 MapR Technologies 90
0 100 200 300 400 500 600
0.00.20.40.60.8
End time
Deltatime
© 2016 MapR Technologies 91
0 100 200 300 400 500
0200400600
Start
End
© 2016 MapR Technologies 92
0 100 200 300 400 500 600
051015202530
End time
Elapsedtime
© 2016 MapR Technologies 93
Some Models Must Model Internal Operations
• Computational architecture can help modeling
• Models need to be built with knowledge of intent and structure
• Pure black box is rarely sufficient … you need some intuitions
© 2016 MapR Technologies 94
Anomaly Detection + Classification  Useful Pair
• Use the AD model to detect anomalies in new data
– Methods such as clustering for discovery can be helpful
• Once you have well-defined models in your system, you may
also want to use classification to tag those
• Continue to use the AD model to find new anomalies
© 2016 MapR Technologies 95
Recap (out of order)
• Anomaly detection is best done with a probability model
• -log p is a good way to convert to anomaly measure
• -log p takes different forms in different systems
• Simplistic distributions insufficient in practice
– Need mixture distributions
– Resampled live data
• Adaptive quantile estimation (t-digest) works for auto-setting
thresholds
© 2016 MapR Technologies 96
Recap
• Different systems require different models
• Continuous time-series
– sparse coding to build signal model
• Events in time
– rate model base on variable rate Poisson
– segregated rate model
• Events with labels
– language modeling
– hidden Markov models
© 2016 MapR Technologies 97
Why Use Anomaly Detection?
© 2016 MapR Technologies 98
Keep in mind…
• Model normal, then find
anomalies
• t-digest for adaptive threshold
• Probabilistic models for
complex patterns
-
0 5 10 15
−20246810
offset+noise+pulse1+pulse2
A
B
© 2016 MapR Technologies 99
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02468
Christmas Prediction
Date
Hits(x1000)
Keep in mind…
• Time intervals are key for
sporadic events
• Complex time shift to predict
rate with seasonality
• Sequence of events reveals
phishing attack
© 2016 MapR Technologies 100
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2016 MapR Technologies 101
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
© 2016 MapR Technologies 102
Thank you for coming today!
© 2016 MapR Technologies 103
bit.ly/sdaml-june2016
Find my slides & other related materials to this talk here:
or search:
© 2016 MapR Technologies 104
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course
discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

More Related Content

Similar to Anomaly Detection: How to find what you didn’t know to look for

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01MapR Technologies
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Using Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsUsing Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsDataWorks Summit/Hadoop Summit
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with ChaosMapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 

Similar to Anomaly Detection: How to find what you didn’t know to look for (20)

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Using Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsUsing Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent Threats
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Anomaly Detection: How to find what you didn’t know to look for

  • 1. © 2016 MapR Technologies 1© 2016 MapR Technologies
  • 2. © 2016 MapR Technologies 2 Anomaly Detection: How To Find What You Didn’t Know to Look For Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Ellen Friedman, Consultant and Commentator Email ellenf@apache.org Twitter @Ellen_Friedman
  • 3. © 2016 MapR Technologies 3 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 4. © 2016 MapR Technologies 4 Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended
  • 5. © 2016 MapR Technologies 5 Anomaly Detection
  • 6. © 2016 MapR Technologies 6 Who Needs Anomaly Detection? Utility providers using smart meters
  • 7. © 2016 MapR Technologies 7 Who Needs Anomaly Detection? Feedback from manufacturing assembly lines
  • 8. © 2016 MapR Technologies 8 Who Needs Anomaly Detection? Monitoring data traffic on communication networks
  • 9. © 2016 MapR Technologies 9 What is Anomaly Detection? • The goal is to discover rare events – especially those that shouldn’t have happened • Find a problem before other people see it – especially before it causes a problem for customers • Why is this a challenge? – I don’t know what an anomaly looks like (yet)
  • 10. © 2016 MapR Technologies 10 Spot the Anomaly
  • 11. © 2016 MapR Technologies 11 Spot the Anomaly Looks pretty anomalous to me
  • 12. © 2016 MapR Technologies 12 Spot the Anomaly Will the real anomaly please stand up?
  • 13. © 2016 MapR Technologies 13 Basic idea: Find “normal” first
  • 14. © 2016 MapR Technologies 14 Steps in Anomaly Detection • Build a model: Collect and process data for training a model • Use the machine learning model to determine what is the normal pattern • Decide how far away from this normal pattern you’ll consider to be anomalous • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful
  • 15. © 2016 MapR Technologies 15 How hard is it to set an alert for anomalies? Grey data is from normal events; x’s are anomalies. Where would you set the threshold?
  • 16. © 2016 MapR Technologies 16 Basic idea: Set adaptive thresholds
  • 17. © 2016 MapR Technologies 17 What Are We Really Doing • We want action when something breaks (dies/falls over/otherwise gets in trouble) • But action is expensive • So we don’t want too many false alarms • And we don’t want too many false negatives • What’s the right threshold to set for alerts? – We need to trade off costs
  • 18. © 2016 MapR Technologies 18 A Second Look
  • 19. © 2016 MapR Technologies 19 A Second Look 99.9%-ile
  • 20. © 2016 MapR Technologies 20 Cool algorithm: t-digest
  • 21. © 2016 MapR Technologies 21 Online Summarizer 99.9%-ile t x > t ? Alarm ! x How Hard Can it Be?
  • 22. © 2016 MapR Technologies 22 Using t-Digest • The t-digest is an on-line percentile estimator – very high accuracy for extreme tails • t-digest also available everywhere – in ElasticSearch, in Solr – in streamlib (open source library on github) – in Mahout Math (open source library on github) – standalone (github and Maven Central) • Very handy for general distributions, few assumptions • For latency, exponential binning may be useful – See, for instance, hdrhistorgram
  • 23. © 2016 MapR Technologies 23 So are we all done?
  • 24. © 2016 MapR Technologies 24 What About This? 0 5 10 15 −20246810 offset+noise+pulse1+pulse2 A B
  • 25. © 2016 MapR Technologies 25 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 26. © 2016 MapR Technologies 26 Spot the Anomaly Anomaly?
  • 27. © 2016 MapR Technologies 27 Maybe not!
  • 28. © 2016 MapR Technologies 28 Where’s Waldo? This is the real anomaly
  • 29. © 2016 MapR Technologies 29 Normal Isn’t Just Normal • What we want is a model of what is normal • What doesn’t fit the model is the anomaly • For simple signals, the model can be simple … • The real world is rarely so accommodating x ~ m(t)+ N(0,e)
  • 30. © 2016 MapR Technologies 30 We Do Windows
  • 31. © 2016 MapR Technologies 31 We Do Windows
  • 32. © 2016 MapR Technologies 32 We Do Windows
  • 33. © 2016 MapR Technologies 33 We Do Windows
  • 34. © 2016 MapR Technologies 34 We Do Windows
  • 35. © 2016 MapR Technologies 35 We Do Windows
  • 36. © 2016 MapR Technologies 36 We Do Windows
  • 37. © 2016 MapR Technologies 37 We Do Windows
  • 38. © 2016 MapR Technologies 38 We Do Windows
  • 39. © 2016 MapR Technologies 39 We Do Windows
  • 40. © 2016 MapR Technologies 40 We Do Windows
  • 41. © 2016 MapR Technologies 41 We Do Windows
  • 42. © 2016 MapR Technologies 42 We Do Windows
  • 43. © 2016 MapR Technologies 43 We Do Windows
  • 44. © 2016 MapR Technologies 44 We Do Windows
  • 45. © 2016 MapR Technologies 45 Windows on the World • The set of windowed signals is a nice model of our original signal • Clustering can find the prototypes – Fancier techniques available using sparse coding • The result is a dictionary of shapes • New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  • 46. © 2016 MapR Technologies 46 Most Common Shapes (for EKG)
  • 47. © 2016 MapR Technologies 47 Reconstructed signal Original signal Reconstructed signal Reconstruction error < 1 bit / sample
  • 48. © 2016 MapR Technologies 48 An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  • 49. © 2016 MapR Technologies 49 Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  • 50. © 2016 MapR Technologies 50 A Different Kind of Anomaly
  • 51. © 2016 MapR Technologies 51 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 52. © 2016 MapR Technologies 52 The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good
  • 53. © 2016 MapR Technologies 53 Some k-means Caveats • But Eamonn Keogh says that k-means can’t work on time-series • That is silly … and kind of correct, k-means does have limits – Other kinds of auto-encoders are much more powerful • More fun and code demos at – https://github.com/tdunning/k-means-auto-encoder http://www.cs.ucr.edu/~eamonn/meaningless.pdf
  • 54. © 2016 MapR Technologies 54 The Limits of Clustering as Auto-encoder • Clustering is like trying to tile your sample distribution • Can be used to approximate a signal • Filling d dimensional region with k clusters should give • If d is large, this is no good e » 1/ kd
  • 55. © 2016 MapR Technologies 55 0 500 1000 1500 2000 −2−1012 Time series training data (first 2000 samples) Time Test data Reconstruction Error
  • 56. © 2016 MapR Technologies 56 0 500 1000 1500 2000 0.000.050.100.15 Reconstruction error for time−series data Centroids MAVError Training data Held−out data
  • 57. © 2016 MapR Technologies 57 Another Example • Take points randomly in , project non-linearly into • Approximation using clustering should give
  • 58. © 2016 MapR Technologies 58 0 500 1000 1500 2000 0.00.51.01.52.0 Reconstruction error for random points Centroids Error Training data Held−out data
  • 59. © 2016 MapR Technologies 59 0 500 1000 1500 2000 0.00.51.01.52.0 Error is approximately cube root of k k Error Actual Cube root model
  • 60. © 2016 MapR Technologies 60 Moral For Auto-encoders • The simplest auto-encoders can be good models • For more complex spaces/signals, more elaborate models may be required • Consider deep learning, recurrent networks, denoising
  • 61. © 2016 MapR Technologies 61 Anomalies among sporadic events
  • 62. © 2016 MapR Technologies 62 Sporadic Web Traffic to an e-Business Site It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. How long after the last event should you begin to worry?
  • 63. © 2016 MapR Technologies 63 Sporadic Web Traffic to an e-Business Site It’s important to know if traffic is stopped or delayed because of a problem… But visits to site normally come at varying intervals. And how do you let your CEO sleep through the night?
  • 64. © 2016 MapR Technologies 64 Basic idea: Time interval between events is how to convert to something useful you can measure
  • 65. © 2016 MapR Technologies 65 Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  • 66. © 2016 MapR Technologies 66 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 67. © 2016 MapR Technologies 67 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 68. © 2016 MapR Technologies 68 But in the real world, event rates often change
  • 69. © 2016 MapR Technologies 69 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 70. © 2016 MapR Technologies 70 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 71. © 2016 MapR Technologies 71 Poisson Distribution • Time between events is exponentially distributed • This means that long delays are exponentially rare • If we know λ we can select a good threshold – or we can pick a threshold empirically Dt ~ le-lt P(Dt > T) = e-lT -logP(Dt > T) = lT
  • 72. © 2016 MapR Technologies 72 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 73. © 2016 MapR Technologies 73 Model-Scaled Intervals Solve the Problem
  • 74. © 2016 MapR Technologies 74 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ log p
  • 75. © 2016 MapR Technologies 75 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 76. © 2016 MapR Technologies 76 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 77. © 2016 MapR Technologies 77 Slipped Week: Simple Rate Predictor Nov 02 Nov 07 Nov 12 Nov 17 Nov 22 Nov 27 Dec 02 0100200300400500 Main Page Traffic Date Hits(x1000) A B C D
  • 78. © 2016 MapR Technologies 78 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 79. © 2016 MapR Technologies 79 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 80. © 2016 MapR Technologies 80 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 81. © 2016 MapR Technologies 81 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive
  • 82. © 2016 MapR Technologies 82 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so
  • 83. © 2016 MapR Technologies 83 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so • Let model see many lagged values • Use L1 regularized linear model to pick important historical values – We would have moved to something fancier if this hadn’t worked
  • 84. © 2016 MapR Technologies 84 A New Rate Predictor for Sporadic Events
  • 85. © 2016 MapR Technologies 85 Improved Prediction with Adaptive Modeling Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 02468 Christmas Prediction Date Hits(x1000)
  • 86. © 2016 MapR Technologies 86 Some days the magic works Some days ... We use slightly different magic
  • 87. © 2016 MapR Technologies 87 Streaming Micro-Service Scenario File upload web service Files Thumbnail extraction Transcoding uploads thumbs recodes Files
  • 88. © 2016 MapR Technologies 88 Let’s Assume a Good Micro-Architecture Thumbnail extraction uploads thumbs metrics exceptions checkpoints Input Output Monitoring Restart
  • 89. © 2016 MapR Technologies 89 How Can We Monitor This? • We want to detect more than just cascading total failure • Arrival time is good for detecting upstream complete loss • What about misbehavior of a black box?
  • 90. © 2016 MapR Technologies 90 0 100 200 300 400 500 600 0.00.20.40.60.8 End time Deltatime
  • 91. © 2016 MapR Technologies 91 0 100 200 300 400 500 0200400600 Start End
  • 92. © 2016 MapR Technologies 92 0 100 200 300 400 500 600 051015202530 End time Elapsedtime
  • 93. © 2016 MapR Technologies 93 Some Models Must Model Internal Operations • Computational architecture can help modeling • Models need to be built with knowledge of intent and structure • Pure black box is rarely sufficient … you need some intuitions
  • 94. © 2016 MapR Technologies 94 Anomaly Detection + Classification  Useful Pair • Use the AD model to detect anomalies in new data – Methods such as clustering for discovery can be helpful • Once you have well-defined models in your system, you may also want to use classification to tag those • Continue to use the AD model to find new anomalies
  • 95. © 2016 MapR Technologies 95 Recap (out of order) • Anomaly detection is best done with a probability model • -log p is a good way to convert to anomaly measure • -log p takes different forms in different systems • Simplistic distributions insufficient in practice – Need mixture distributions – Resampled live data • Adaptive quantile estimation (t-digest) works for auto-setting thresholds
  • 96. © 2016 MapR Technologies 96 Recap • Different systems require different models • Continuous time-series – sparse coding to build signal model • Events in time – rate model base on variable rate Poisson – segregated rate model • Events with labels – language modeling – hidden Markov models
  • 97. © 2016 MapR Technologies 97 Why Use Anomaly Detection?
  • 98. © 2016 MapR Technologies 98 Keep in mind… • Model normal, then find anomalies • t-digest for adaptive threshold • Probabilistic models for complex patterns - 0 5 10 15 −20246810 offset+noise+pulse1+pulse2 A B
  • 99. © 2016 MapR Technologies 99 Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 02468 Christmas Prediction Date Hits(x1000) Keep in mind… • Time intervals are key for sporadic events • Complex time shift to predict rate with seasonality • Sequence of events reveals phishing attack
  • 100. © 2016 MapR Technologies 100 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 101. © 2016 MapR Technologies 101 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  • 102. © 2016 MapR Technologies 102 Thank you for coming today!
  • 103. © 2016 MapR Technologies 103 bit.ly/sdaml-june2016 Find my slides & other related materials to this talk here: or search:
  • 104. © 2016 MapR Technologies 104 …helping you put data technology to work ● Find answers ● Ask technical questions ● Join on-demand training course discussions ● Follow release announcements ● Share and vote on product ideas ● Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com