SlideShare a Scribd company logo
1 of 83
Download to read offline
© 2017 MapR Technologies 1
Detecting Change
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 3
Who We Are
• MapR Technologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
© 2017 MapR Technologies 4
Basic Outline
• Goal Setting
• Basic Ideas
– LLR (finding changes in counts)
– Poisson rate change detection (finding changes in events timing)
– Distribution estimation / visualization
– Labeled events and adding labels
• Free Improvisation on Themes
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
© 2017 MapR Technologies 8
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
© 2017 MapR Technologies 9
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
© 2017 MapR Technologies 10
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
• And thus the student was enlightened
© 2017 MapR Technologies 11
• There are some very powerful techniques available, some only
very recently, that can make the detection of change much
easier than you might think. I will describe the practical use of
several of these techniques including t-digest, non-linear
histograms, variable rate Poisson models and combinations of
these.
© 2017 MapR Technologies 12
Comparing Counts
• Suppose we have two situations A and B, each with many
observations, nA and nB
• And some event x occurred n1A and n1B times in each situation
x other
A n1A nA - n1A
B n1B nB - n1B
© 2017 MapR Technologies 13
Comparing Counts
• Have we seen a change in the frequency of x?
• Frequency ratios?
– Breaks with small counts
• - test?
– Breaks with small counts
© 2017 MapR Technologies 14
Log-Likelihood Ratio Test (Root LLR)
• In R
entropy = function(k) {
-sum(k*log((k==0)+(k/sum(k))))
}
llr = function(k) {
(entropy(rowSums(k))+entropy(colSums(k))
-entropy(k))*2
}
• Like mutual information * 2 N
© 2017 MapR Technologies 15
Spot the Anomaly
• Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.89 1.95
4.51 14.29
© 2017 MapR Technologies 16
How Does it Work
Empirical fit to asymptotic
distribution is very good
© 2017 MapR Technologies 17
How Does it Work?
© 2017 MapR Technologies 18
OK
We can detect changes in counts
© 2017 MapR Technologies 19
Real-life Example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2017 MapR Technologies 20
Real-life Example
© 2017 MapR Technologies 21
Example 2 - Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data
© 2017 MapR Technologies 22
Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed
data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants
© 2017 MapR Technologies 23
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2017 MapR Technologies 24
Detection Strategy
• Select histories that precede non-fraud
• And histories that precede fraud detection
• Analyze 2x2 cooccurrence of merchant n versus fraud
detection
© 2017 MapR Technologies 25
© 2017 MapR Technologies 26
What about the
real world?
© 2017 MapR Technologies 27
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys
© 2017 MapR Technologies 28
What about time?
© 2017 MapR Technologies 29
Finding Changes in Timing
• Suppose our input is events embedded in time
• Suppose we want to find changes in our input in real-time
• Waiting and counting is fine if we don’t have to react now
• We can do much better
© 2017 MapR Technologies 30
Poisson Event Rate Change
• Detection of fallout
– Time since last is very sensitive for complete failure
• Detection of change relative to reference
– Time since n-th most recent
– LLR with time
• Have to trade detection speed versus false positive rate and
size of change
• Can run multiple detectors at once
© 2017 MapR Technologies 31
Basic idea:
Time interval is better than counts
© 2017 MapR Technologies 32
Sporadic Events: Finding Normal and Anomalous Patterns
• Time between intervals is much more usable than absolute
times
• Counts don’t link as directly to probability models
• Time interval is log ρ
• This is a big deal
© 2017 MapR Technologies 33
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2017 MapR Technologies 34
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2017 MapR Technologies 35
In the real world,
event rates often vary
© 2017 MapR Technologies 36
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2017 MapR Technologies 37
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2017 MapR Technologies 38
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T) = e-lT
-logP(Dt > T) = lT
© 2017 MapR Technologies 39
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2017 MapR Technologies 40
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 41
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 42
Seasonality Poses a Challenge
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2017 MapR Technologies 43
Something more is needed …
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2017 MapR Technologies 44
We need a better rate predictor…
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 45
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
© 2017 MapR Technologies 46
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
© 2017 MapR Technologies 47
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
• Let model see many lagged values
• Use L1 regularized linear model to pick important historical
values
– We would have moved to something fancier if this hadn’t worked
© 2017 MapR Technologies 48
A New Rate Predictor for Sporadic Events
© 2017 MapR Technologies 49
Improved Prediction with Adaptive Modeling
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02468
Christmas Prediction
Date
Hits(x1000)
© 2017 MapR Technologies 50
Some days the magic works
Some days ...
We use slightly different magic
© 2017 MapR Technologies 51
Detecting More Subtle Changes
• Time-since-last finds complete failures well
• Nth order time finds more subtle rate changes
• But that subtlety delays detection of complete failure
– First order delay has 99.9% confidence at 6.5 units
– 10th order delay has 99.9% confidence at 12.5 units
• But 10th order delay can find speedups, first order cannot
© 2017 MapR Technologies 57
10th order difference of
Poisson distribution
© 2017 MapR Technologies 58
Finding Changes in Time Series
• So far, we only have times
• What about when we have times and measurements together?
– These are called time-series!
• First step can be to discretize the measurement
– Quintiles or deciles are good candidates
– Multi-scale discretization is a fine thing to do
• That gives us arrival times for measurements in each bin
– And this is susceptible to the rate model on previous slides
© 2017 MapR Technologies 59
Finding Changes in Time Series
• Comprehensive approaches also possible (for counts)
• Time aware variant of G-test is possible
vs
Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March
1993)
http://bit.ly/surprise-and-coincidence
© 2017 MapR Technologies 60
Propagation Anomalies
• What happens when something shadows part of the coverage
field for mobile telecom?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2017 MapR Technologies 61
© 2017 MapR Technologies 62
© 2017 MapR Technologies 63
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2017 MapR Technologies 64
Other Issues
• Finding changes in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2017 MapR Technologies 65
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2017 MapR Technologies 66
Tower Coverage Areas
© 2017 MapR Technologies 67
Just One Tower
© 2017 MapR Technologies 68
Cluster Reports for That Tower
© 2017 MapR Technologies 69
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
Can also sub-divide each cluster
into signal strength ranges
Multiple scales of clustering
can also be used to trade off
geographic versus temporal
resolution
© 2017 MapR Technologies 70
Example
0.00.51.01.5
dt
01234567
dt
0.00.20.40.6
dt
Each cluster gives us a
sequence of events.
Individual anomaly scores can
be scaled and added to get
composite anomaly score
Optimality of combined signal
derives from optimality of
components.
© 2017 MapR Technologies 71
Characterizing Distributions
• What about sequences of values from arbitrary distributions
– Can we find changes in the distribution?
– For instance, what about latencies?
• Non-linear histogram - FloatHistogram
• Fully Adaptive histogram – t-digest
© 2017 MapR Technologies 72
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
© 2017 MapR Technologies 73
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 74
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
© 2017 MapR Technologies 75
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 76
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 77
Finding Change with Histograms
• With fixed bins, we can simply count and compare counts for
different bins
• Thus, histogram change reduces to count change
• Or to changes in event times
© 2017 MapR Technologies 78
Visualizing Histograms
• We want to detect small changes
– Consider log-scale for Y
• Non-linear bin spacing is really good for increasing counts
– Reweight by bin-width
– Changing x axis changes y axis
© 2017 MapR Technologies 79
Good Results
© 2017 MapR Technologies 80
Bad Results
© 2017 MapR Technologies 81
Bad Results
© 2017 MapR Technologies 82
With Better Scaling
© 2017 MapR Technologies 83
Bad Results
© 2017 MapR Technologies 84
© 2017 MapR Technologies 85
With FloatHistogram
© 2017 MapR Technologies 86
Summary
• Counts – LLR
• Events – Poisson + nth-order diffs
• Decimate in space
• Decimate in measurement space
– t-digest, FloatHistogram
• Don’t forget visualization
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 87
Q & A
© 2017 MapR Technologies 88
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning

More Related Content

What's hot

Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 

What's hot (20)

Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 

Similar to Finding Changes in Real Data

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouMatt Stubbs
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupAlan Iovine
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceRon Bodkin
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricBig Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricMatt Stubbs
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 

Similar to Finding Changes in Real Data (20)

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetup
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligence
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricBig Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data Fabric
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 

More from Ted Dunning (7)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Recently uploaded

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Recently uploaded (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

Finding Changes in Real Data

  • 1. © 2017 MapR Technologies 1 Detecting Change
  • 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning
  • 3. © 2017 MapR Technologies 3 Who We Are • MapR Technologies – We make a kick-ass platform for big data computing – Support many workloads including Hadoop / Spark / HPC / Other – Extended to allow streams and tables in basic platform – Free for academic research / training • Apache Software Foundation – Culture hub for building open source communities – Shared values around openness for contribution as well as use – Many major projects are part of Apache – Even more minor ones!
  • 4. © 2017 MapR Technologies 4 Basic Outline • Goal Setting • Basic Ideas – LLR (finding changes in counts) – Poisson rate change detection (finding changes in events timing) – Distribution estimation / visualization – Labeled events and adding labels • Free Improvisation on Themes
  • 5. © 2017 MapR Technologies 5 Why Is This Practically Important • The novice came to the master and says “something is broken”
  • 6. © 2017 MapR Technologies 6 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?”
  • 7. © 2017 MapR Technologies 7 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?” • And the student was enlightened
  • 8. © 2017 MapR Technologies 8 The Second Student • Another student said to the master, “I see something has changed … something may have broken”
  • 9. © 2017 MapR Technologies 9 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment”
  • 10. © 2017 MapR Technologies 10 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment” • And thus the student was enlightened
  • 11. © 2017 MapR Technologies 11 • There are some very powerful techniques available, some only very recently, that can make the detection of change much easier than you might think. I will describe the practical use of several of these techniques including t-digest, non-linear histograms, variable rate Poisson models and combinations of these.
  • 12. © 2017 MapR Technologies 12 Comparing Counts • Suppose we have two situations A and B, each with many observations, nA and nB • And some event x occurred n1A and n1B times in each situation x other A n1A nA - n1A B n1B nB - n1B
  • 13. © 2017 MapR Technologies 13 Comparing Counts • Have we seen a change in the frequency of x? • Frequency ratios? – Breaks with small counts • - test? – Breaks with small counts
  • 14. © 2017 MapR Technologies 14 Log-Likelihood Ratio Test (Root LLR) • In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } llr = function(k) { (entropy(rowSums(k))+entropy(colSums(k)) -entropy(k))*2 } • Like mutual information * 2 N
  • 15. © 2017 MapR Technologies 15 Spot the Anomaly • Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.89 1.95 4.51 14.29
  • 16. © 2017 MapR Technologies 16 How Does it Work Empirical fit to asymptotic distribution is very good
  • 17. © 2017 MapR Technologies 17 How Does it Work?
  • 18. © 2017 MapR Technologies 18 OK We can detect changes in counts
  • 19. © 2017 MapR Technologies 19 Real-life Example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 20. © 2017 MapR Technologies 20 Real-life Example
  • 21. © 2017 MapR Technologies 21 Example 2 - Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  • 22. © 2017 MapR Technologies 22 Example 2 - Common Point of Compromise skim exploit Merchant 0 Skimmed data Merchant n Card data is stolen from Merchant 0 That data is used in frauds at other merchants
  • 23. © 2017 MapR Technologies 23 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 24. © 2017 MapR Technologies 24 Detection Strategy • Select histories that precede non-fraud • And histories that precede fraud detection • Analyze 2x2 cooccurrence of merchant n versus fraud detection
  • 25. © 2017 MapR Technologies 25
  • 26. © 2017 MapR Technologies 26 What about the real world?
  • 27. © 2017 MapR Technologies 27 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  • 28. © 2017 MapR Technologies 28 What about time?
  • 29. © 2017 MapR Technologies 29 Finding Changes in Timing • Suppose our input is events embedded in time • Suppose we want to find changes in our input in real-time • Waiting and counting is fine if we don’t have to react now • We can do much better
  • 30. © 2017 MapR Technologies 30 Poisson Event Rate Change • Detection of fallout – Time since last is very sensitive for complete failure • Detection of change relative to reference – Time since n-th most recent – LLR with time • Have to trade detection speed versus false positive rate and size of change • Can run multiple detectors at once
  • 31. © 2017 MapR Technologies 31 Basic idea: Time interval is better than counts
  • 32. © 2017 MapR Technologies 32 Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  • 33. © 2017 MapR Technologies 33 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 34. © 2017 MapR Technologies 34 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 35. © 2017 MapR Technologies 35 In the real world, event rates often vary
  • 36. © 2017 MapR Technologies 36 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 37. © 2017 MapR Technologies 37 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 38. © 2017 MapR Technologies 38 Poisson Distribution • Time between events is exponentially distributed • This means that long delays are exponentially rare • If we know λ we can select a good threshold – or we can pick a threshold empirically Dt ~ le-lt P(Dt > T) = e-lT -logP(Dt > T) = lT
  • 39. © 2017 MapR Technologies 39 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 40. © 2017 MapR Technologies 40 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 41. © 2017 MapR Technologies 41 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 42. © 2017 MapR Technologies 42 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 43. © 2017 MapR Technologies 43 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 44. © 2017 MapR Technologies 44 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 45. © 2017 MapR Technologies 45 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive
  • 46. © 2017 MapR Technologies 46 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so
  • 47. © 2017 MapR Technologies 47 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so • Let model see many lagged values • Use L1 regularized linear model to pick important historical values – We would have moved to something fancier if this hadn’t worked
  • 48. © 2017 MapR Technologies 48 A New Rate Predictor for Sporadic Events
  • 49. © 2017 MapR Technologies 49 Improved Prediction with Adaptive Modeling Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 02468 Christmas Prediction Date Hits(x1000)
  • 50. © 2017 MapR Technologies 50 Some days the magic works Some days ... We use slightly different magic
  • 51. © 2017 MapR Technologies 51 Detecting More Subtle Changes • Time-since-last finds complete failures well • Nth order time finds more subtle rate changes • But that subtlety delays detection of complete failure – First order delay has 99.9% confidence at 6.5 units – 10th order delay has 99.9% confidence at 12.5 units • But 10th order delay can find speedups, first order cannot
  • 52. © 2017 MapR Technologies 57 10th order difference of Poisson distribution
  • 53. © 2017 MapR Technologies 58 Finding Changes in Time Series • So far, we only have times • What about when we have times and measurements together? – These are called time-series! • First step can be to discretize the measurement – Quintiles or deciles are good candidates – Multi-scale discretization is a fine thing to do • That gives us arrival times for measurements in each bin – And this is susceptible to the rate model on previous slides
  • 54. © 2017 MapR Technologies 59 Finding Changes in Time Series • Comprehensive approaches also possible (for counts) • Time aware variant of G-test is possible vs Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March 1993) http://bit.ly/surprise-and-coincidence
  • 55. © 2017 MapR Technologies 60 Propagation Anomalies • What happens when something shadows part of the coverage field for mobile telecom? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  • 56. © 2017 MapR Technologies 61
  • 57. © 2017 MapR Technologies 62
  • 58. © 2017 MapR Technologies 63 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  • 59. © 2017 MapR Technologies 64 Other Issues • Finding changes in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  • 60. © 2017 MapR Technologies 65 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  • 61. © 2017 MapR Technologies 66 Tower Coverage Areas
  • 62. © 2017 MapR Technologies 67 Just One Tower
  • 63. © 2017 MapR Technologies 68 Cluster Reports for That Tower
  • 64. © 2017 MapR Technologies 69 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9 Can also sub-divide each cluster into signal strength ranges Multiple scales of clustering can also be used to trade off geographic versus temporal resolution
  • 65. © 2017 MapR Technologies 70 Example 0.00.51.01.5 dt 01234567 dt 0.00.20.40.6 dt Each cluster gives us a sequence of events. Individual anomaly scores can be scaled and added to get composite anomaly score Optimality of combined signal derives from optimality of components.
  • 66. © 2017 MapR Technologies 71 Characterizing Distributions • What about sequences of values from arbitrary distributions – Can we find changes in the distribution? – For instance, what about latencies? • Non-linear histogram - FloatHistogram • Fully Adaptive histogram – t-digest
  • 67. © 2017 MapR Technologies 72 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space
  • 68. © 2017 MapR Technologies 73 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  • 69. © 2017 MapR Technologies 74 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1
  • 70. © 2017 MapR Technologies 75 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 71. © 2017 MapR Technologies 76 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 • Interpolate using centroids in x • Very good near extremes, no dynamic allocation 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 72. © 2017 MapR Technologies 77 Finding Change with Histograms • With fixed bins, we can simply count and compare counts for different bins • Thus, histogram change reduces to count change • Or to changes in event times
  • 73. © 2017 MapR Technologies 78 Visualizing Histograms • We want to detect small changes – Consider log-scale for Y • Non-linear bin spacing is really good for increasing counts – Reweight by bin-width – Changing x axis changes y axis
  • 74. © 2017 MapR Technologies 79 Good Results
  • 75. © 2017 MapR Technologies 80 Bad Results
  • 76. © 2017 MapR Technologies 81 Bad Results
  • 77. © 2017 MapR Technologies 82 With Better Scaling
  • 78. © 2017 MapR Technologies 83 Bad Results
  • 79. © 2017 MapR Technologies 84
  • 80. © 2017 MapR Technologies 85 With FloatHistogram
  • 81. © 2017 MapR Technologies 86 Summary • Counts – LLR • Events – Poisson + nth-order diffs • Decimate in space • Decimate in measurement space – t-digest, FloatHistogram • Don’t forget visualization Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 82. © 2017 MapR Technologies 87 Q & A
  • 83. © 2017 MapR Technologies 88 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning

Editor's Notes

  1. Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  2. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  3. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  4. Talk track: This slide is here for reference when you download the slides
  5. Ted: this was figure 5-2 in the book
  6. Talk track: You need a rate predictor Ellen: sometimes simple is good enough
  7. Ted: This was figure 5.4
  8. Ted: This was figure 5.4
  9. Ted: this was figure 5-2 in the book
  10. We can look at yesterday and day before but need to look at the shape from previous days … but look at today for whether traffic is scaling
  11. Ted: This was figure 5.4