Piotr Guzik discusses developing an anomaly detection model for clickstream data. The goal is to quickly detect if data is lost or abnormal. An initial statistical model is created in R but has issues. The model is then rewritten from scratch in Scala to be simpler, time-aware, adapt to trends, avoid false positives, and use confidence intervals. The output provides a probability and sign of anomaly, with smaller anomalies smoothed and larger flagged. It also accounts for anomalies that last a long time by treating them as the new normal. The deployment is as a SaaS with multiple clients using the same codebase but different configurations to react to anomalies.
3. Why anomaly detection is interesting ?
3
Anomaly detection on clickstream is all about:
● SLA (data should be a first-class citizen)
● You should be the first to know if something is wrong
“Engineers in XX Century made mistakes but never by more than
one order of magnitude. In IT it is not that good.”
4. Motivation and goals
4
Goal: Get quick information if the data is lost
● Losing data is somehow similar to losing money
● “You cannot improve what you cannot measure”
● Team responsible for given service should be alerted when
something is wrong
5. How to start ? - important questions
5
● How to get the data ?
● Real-time detection ?
● Delay ?
● What is an anomaly ?
6. Discovering datasource and data itself
6
● Datasource - Druid
● OLAP cube dimensions as domains
● Data aggregated every 15 minutes
● Metric - simplest count
7. What is a core data ?
7
Data ~= result of the query:
● select count (*) as cnt, category,action,time_window_15_m
from page_views
where category = ‘Search’ and action = ’ShowItem’
group by category, action, time_window_15_m
9. Knowing the data
9
● Clickstream is periodical
● Week == period
● Days of week differs a lot
● There is a rapid increase in web traffic about 6PM and it starts
to fall at about 10PM
10. Research
10
Motto: Solution must be easy. Not only for data scientist.
Available solutions:
● Twitter library - too hard, heavy math, many hyperparameters
● HTM algorithms - way too hard, neural networks, deep
learning, very hard to reason about algorithm and its results
We have to create our own simple model
11. How our model should be ?
11
Perfect model:
● Simple
● Time aware
● Detection is in minutes rather than hours
● Adapt to trends (ads, currently popular items)
● Should not report too many false-positives
● Use confidence intervals
14. F.A.I.L. - first attempt in learning
14
Simple statistical model in R
First results:
● Rapid change of metric is a
problem
● Trend is important but cannot
lead to overfitting
15. Experimenting in progress
15
After model evolution:
● Outliers are problematic (sd)
● Outliers == duplicates of data
on HDFS (thank you Camus!)
● Percentiles are great for
outliers removal
16. Problems with R
16
● Only Data Scientist knows R
● There is not an easy way to deploy it
● You cannot monitor it easily
● It is hard to maintain
Decision: we have to rewrite it. From scratch. In Scala.
23. Anomaly Detection - did we miss something ?
23
● Long lasting anomaly is not
an anomaly anymore
● Loss of data is crucial
● Output should be easy to
understand
24. Long lasting anomalies - key concepts
24
Output: probability (with sign) of anomaly
● Small anomalies should be smoothen and larger should be
outraged (monitoring and alerting)
● We define where obvious anomalies starts
● We define after how long we should treat anomalies as a
norm (be careful here)
25. Long lasting anomalies - fix
25
In case of long lasting anomalies, we multiply all model params,
as if we were wrong from the beginning
28. Whole team - thank you
28
It was more than just me and my team involved in this process:
Big thanks to:
● My team for motivation and hot discussions :)
● Paweł Zawistowski - initial model in R
● Other teams for real use cases (that is why you would like to
be in production quickly)