SlideShare a Scribd company logo
1 of 29
Download to read offline
Anomaly Detection
made easy
Piotr Guzik
$whoami
2
● Data Engineer @Allegro (Scala, Kafka,
Spark, Ansible, ML)
● Trainer @GetInData
● https://twitter.com/guzik_io
● Data Science flavour
Why anomaly detection is interesting ?
3
Anomaly detection on clickstream is all about:
● SLA (data should be a first-class citizen)
● You should be the first to know if something is wrong
“Engineers in XX Century made mistakes but never by more than
one order of magnitude. In IT it is not that good.”
Motivation and goals
4
Goal: Get quick information if the data is lost
● Losing data is somehow similar to losing money
● “You cannot improve what you cannot measure”
● Team responsible for given service should be alerted when
something is wrong
How to start ? - important questions
5
● How to get the data ?
● Real-time detection ?
● Delay ?
● What is an anomaly ?
Discovering datasource and data itself
6
● Datasource - Druid
● OLAP cube dimensions as domains
● Data aggregated every 15 minutes
● Metric - simplest count
What is a core data ?
7
Data ~= result of the query:
● select count (*) as cnt, category,action,time_window_15_m
from page_views
where category = ‘Search’ and action = ’ShowItem’
group by category, action, time_window_15_m
First look at the data
8
Knowing the data
9
● Clickstream is periodical
● Week == period
● Days of week differs a lot
● There is a rapid increase in web traffic about 6PM and it starts
to fall at about 10PM
Research
10
Motto: Solution must be easy. Not only for data scientist.
Available solutions:
● Twitter library - too hard, heavy math, many hyperparameters
● HTM algorithms - way too hard, neural networks, deep
learning, very hard to reason about algorithm and its results
We have to create our own simple model
How our model should be ?
11
Perfect model:
● Simple
● Time aware
● Detection is in minutes rather than hours
● Adapt to trends (ads, currently popular items)
● Should not report too many false-positives
● Use confidence intervals
Best tool for inventing algorithm
12
Model draft
13
F.A.I.L. - first attempt in learning
14
Simple statistical model in R
First results:
● Rapid change of metric is a
problem
● Trend is important but cannot
lead to overfitting
Experimenting in progress
15
After model evolution:
● Outliers are problematic (sd)
● Outliers == duplicates of data
on HDFS (thank you Camus!)
● Percentiles are great for
outliers removal
Problems with R
16
● Only Data Scientist knows R
● There is not an easy way to deploy it
● You cannot monitor it easily
● It is hard to maintain
Decision: we have to rewrite it. From scratch. In Scala.
Input from Druid
17
Model
18
Some math (ema !)
19
Trend (fast changing world)
20
Learning is a difficult process
21
What if we learned something that is not valid anymore ?
Mean could be bad, but what about ema ?
Anomaly Detection - almost there ?
22
Anomaly Detection - did we miss something ?
23
● Long lasting anomaly is not
an anomaly anymore
● Loss of data is crucial
● Output should be easy to
understand
Long lasting anomalies - key concepts
24
Output: probability (with sign) of anomaly
● Small anomalies should be smoothen and larger should be
outraged (monitoring and alerting)
● We define where obvious anomalies starts
● We define after how long we should treat anomalies as a
norm (be careful here)
Long lasting anomalies - fix
25
In case of long lasting anomalies, we multiply all model params,
as if we were wrong from the beginning
Deployment
26
SaaS model
● Multiple deployments with same codebase
● Different configuration
● Clients define how they want to react
Configuration example
27
Whole team - thank you
28
It was more than just me and my team involved in this process:
Big thanks to:
● My team for motivation and hot discussions :)
● Paweł Zawistowski - initial model in R
● Other teams for real use cases (that is why you would like to
be in production quickly)
Thank you
Q & A
Piotr Guzik

More Related Content

Viewers also liked

Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Lukasz Kaluzny
 

Viewers also liked (6)

Real-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactionsReal-time fraud detection in credit card transactions
Real-time fraud detection in credit card transactions
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
 
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
Jak zbudować aplikacje z wykorzystaniem funkcjonalności windows server 2016...
 
Self-service BI for SAP and HANA – Dream or Reality?
Self-service BI for SAP and HANA – Dream or Reality?Self-service BI for SAP and HANA – Dream or Reality?
Self-service BI for SAP and HANA – Dream or Reality?
 

Similar to Anomaly detection made easy

Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Chris Hammerschmidt
 
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Lviv Startup Club
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media
 

Similar to Anomaly detection made easy (20)

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Mqug2015 july richard whyte
Mqug2015 july richard whyteMqug2015 july richard whyte
Mqug2015 july richard whyte
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
 
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality HeroWebinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero
Webinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero
 
Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Real-Time Anomaly Detection and Root Cause Analysis
Real-Time Anomaly Detection and Root Cause AnalysisReal-Time Anomaly Detection and Root Cause Analysis
Real-Time Anomaly Detection and Root Cause Analysis
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
Symposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence ArtificielleSymposium 2019 : Gestion de projet en Intelligence Artificielle
Symposium 2019 : Gestion de projet en Intelligence Artificielle
 
Better Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product SchoolBetter Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product School
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Meet the Ghost of SecOps Future by Anton Chuvakin
Meet the Ghost of SecOps Future by Anton ChuvakinMeet the Ghost of SecOps Future by Anton Chuvakin
Meet the Ghost of SecOps Future by Anton Chuvakin
 

Recently uploaded

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Anomaly detection made easy

  • 2. $whoami 2 ● Data Engineer @Allegro (Scala, Kafka, Spark, Ansible, ML) ● Trainer @GetInData ● https://twitter.com/guzik_io ● Data Science flavour
  • 3. Why anomaly detection is interesting ? 3 Anomaly detection on clickstream is all about: ● SLA (data should be a first-class citizen) ● You should be the first to know if something is wrong “Engineers in XX Century made mistakes but never by more than one order of magnitude. In IT it is not that good.”
  • 4. Motivation and goals 4 Goal: Get quick information if the data is lost ● Losing data is somehow similar to losing money ● “You cannot improve what you cannot measure” ● Team responsible for given service should be alerted when something is wrong
  • 5. How to start ? - important questions 5 ● How to get the data ? ● Real-time detection ? ● Delay ? ● What is an anomaly ?
  • 6. Discovering datasource and data itself 6 ● Datasource - Druid ● OLAP cube dimensions as domains ● Data aggregated every 15 minutes ● Metric - simplest count
  • 7. What is a core data ? 7 Data ~= result of the query: ● select count (*) as cnt, category,action,time_window_15_m from page_views where category = ‘Search’ and action = ’ShowItem’ group by category, action, time_window_15_m
  • 8. First look at the data 8
  • 9. Knowing the data 9 ● Clickstream is periodical ● Week == period ● Days of week differs a lot ● There is a rapid increase in web traffic about 6PM and it starts to fall at about 10PM
  • 10. Research 10 Motto: Solution must be easy. Not only for data scientist. Available solutions: ● Twitter library - too hard, heavy math, many hyperparameters ● HTM algorithms - way too hard, neural networks, deep learning, very hard to reason about algorithm and its results We have to create our own simple model
  • 11. How our model should be ? 11 Perfect model: ● Simple ● Time aware ● Detection is in minutes rather than hours ● Adapt to trends (ads, currently popular items) ● Should not report too many false-positives ● Use confidence intervals
  • 12. Best tool for inventing algorithm 12
  • 14. F.A.I.L. - first attempt in learning 14 Simple statistical model in R First results: ● Rapid change of metric is a problem ● Trend is important but cannot lead to overfitting
  • 15. Experimenting in progress 15 After model evolution: ● Outliers are problematic (sd) ● Outliers == duplicates of data on HDFS (thank you Camus!) ● Percentiles are great for outliers removal
  • 16. Problems with R 16 ● Only Data Scientist knows R ● There is not an easy way to deploy it ● You cannot monitor it easily ● It is hard to maintain Decision: we have to rewrite it. From scratch. In Scala.
  • 19. Some math (ema !) 19
  • 20. Trend (fast changing world) 20
  • 21. Learning is a difficult process 21 What if we learned something that is not valid anymore ? Mean could be bad, but what about ema ?
  • 22. Anomaly Detection - almost there ? 22
  • 23. Anomaly Detection - did we miss something ? 23 ● Long lasting anomaly is not an anomaly anymore ● Loss of data is crucial ● Output should be easy to understand
  • 24. Long lasting anomalies - key concepts 24 Output: probability (with sign) of anomaly ● Small anomalies should be smoothen and larger should be outraged (monitoring and alerting) ● We define where obvious anomalies starts ● We define after how long we should treat anomalies as a norm (be careful here)
  • 25. Long lasting anomalies - fix 25 In case of long lasting anomalies, we multiply all model params, as if we were wrong from the beginning
  • 26. Deployment 26 SaaS model ● Multiple deployments with same codebase ● Different configuration ● Clients define how they want to react
  • 28. Whole team - thank you 28 It was more than just me and my team involved in this process: Big thanks to: ● My team for motivation and hot discussions :) ● Paweł Zawistowski - initial model in R ● Other teams for real use cases (that is why you would like to be in production quickly)
  • 29. Thank you Q & A Piotr Guzik