Anomaly Detection made easy in Scala

•

0 likes•159 views

Piotr Guzik discusses developing an anomaly detection model for clickstream data. The goal is to quickly detect if data is lost or abnormal. An initial statistical model is created in R but has issues. The model is then rewritten from scratch in Scala to be simpler, time-aware, adapt to trends, avoid false positives, and use confidence intervals. The output provides a probability and sign of anomaly, with smaller anomalies smoothed and larger flagged. It also accounts for anomalies that last a long time by treating them as the new normal. The deployment is as a SaaS with multiple clients using the same codebase but different configurations to react to anomalies.

Data & Analytics

$whoami
2
● Data Engineer @Allegro (Scala, Kafka,
Spark, Ansible, ML)
● Trainer @GetInData
● https://twitter.com/guzik_io
● Data Science flavour

Why anomaly detection is interesting ?
3
Anomaly detection on clickstream is all about:
● SLA (data should be a first-class citizen)
● You should be the first to know if something is wrong
“Engineers in XX Century made mistakes but never by more than
one order of magnitude. In IT it is not that good.”

Motivation and goals
4
Goal: Get quick information if the data is lost
● Losing data is somehow similar to losing money
● “You cannot improve what you cannot measure”
● Team responsible for given service should be alerted when
something is wrong

How to start ? - important questions
5
● How to get the data ?
● Real-time detection ?
● Delay ?
● What is an anomaly ?

Discovering datasource and data itself
6
● Datasource - Druid
● OLAP cube dimensions as domains
● Data aggregated every 15 minutes
● Metric - simplest count

What is a core data ?
7
Data ~= result of the query:
● select count (*) as cnt, category,action,time_window_15_m
from page_views
where category = ‘Search’ and action = ’ShowItem’
group by category, action, time_window_15_m

Knowing the data
9
● Clickstream is periodical
● Week == period
● Days of week differs a lot
● There is a rapid increase in web traffic about 6PM and it starts
to fall at about 10PM

Research
10
Motto: Solution must be easy. Not only for data scientist.
Available solutions:
● Twitter library - too hard, heavy math, many hyperparameters
● HTM algorithms - way too hard, neural networks, deep
learning, very hard to reason about algorithm and its results
We have to create our own simple model

How our model should be ?
11
Perfect model:
● Simple
● Time aware
● Detection is in minutes rather than hours
● Adapt to trends (ads, currently popular items)
● Should not report too many false-positives
● Use confidence intervals

F.A.I.L. - first attempt in learning
14
Simple statistical model in R
First results:
● Rapid change of metric is a
problem
● Trend is important but cannot
lead to overfitting

Experimenting in progress
15
After model evolution:
● Outliers are problematic (sd)
● Outliers == duplicates of data
on HDFS (thank you Camus!)
● Percentiles are great for
outliers removal

Problems with R
16
● Only Data Scientist knows R
● There is not an easy way to deploy it
● You cannot monitor it easily
● It is hard to maintain
Decision: we have to rewrite it. From scratch. In Scala.

Learning is a difficult process
21
What if we learned something that is not valid anymore ?
Mean could be bad, but what about ema ?

Anomaly Detection - did we miss something ?
23
● Long lasting anomaly is not
an anomaly anymore
● Loss of data is crucial
● Output should be easy to
understand

Long lasting anomalies - key concepts
24
Output: probability (with sign) of anomaly
● Small anomalies should be smoothen and larger should be
outraged (monitoring and alerting)
● We define where obvious anomalies starts
● We define after how long we should treat anomalies as a
norm (be careful here)

Long lasting anomalies - fix
25
In case of long lasting anomalies, we multiply all model params,
as if we were wrong from the beginning

Deployment
26
SaaS model
● Multiple deployments with same codebase
● Different configuration
● Clients define how they want to react

Whole team - thank you
28
It was more than just me and my team involved in this process:
Big thanks to:
● My team for motivation and hot discussions :)
● Paweł Zawistowski - initial model in R
● Other teams for real use cases (that is why you would like to
be in production quickly)

Similar to Anomaly Detection made easy in Scala

Agile Data ScienceVolodymyr Kazantsev

"What we learned from 5 years of building a data science software that actual...Dataconomy Media

Mqug2015 july richard whyteRichard Whyte

Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Chris Hammerschmidt

Webinar | Good Guys vs. Bad Data: How to Be a Data Quality HeroAngela Sun

Black Ops Testing Workshop from Agile Testing Days 2014Alan Richardson

Limits of Machine LearningAlexey Grigorev

Real-Time Anomaly Detection and Root Cause AnalysisYotascale

AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera

Ml masterclassMaxwell Rebo

Beat the Benchmark.Pruthuvi Maheshakya Wijewardena

Artur Suchwalko “What are common mistakes in Data Science projects and how to...Lviv Startup Club

Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group

Evaluation of big data analysisΚαρολίνα Κάτι

Symposium 2019 : Gestion de projet en Intelligence ArtificiellePMI-Montréal

Better Living Through Analytics - Louis Cialdella Product SchoolLouis Cialdella

DataOps - Lean principles and lean practicesLars Albertsson

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...Dataconomy Media

Meet the Ghost of SecOps Future by Anton ChuvakinAnton Chuvakin

Similar to Anomaly Detection made easy in Scala (20)

Agile Data Science

"What we learned from 5 years of building a data science software that actual...

Mqug2015 july richard whyte

Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...

Webinar | Good Guys vs. Bad Data: How to Be a Data Quality Hero

Black Ops Testing Workshop from Agile Testing Days 2014

Limits of Machine Learning

Real-Time Anomaly Detection and Root Cause Analysis

AI in the Real World: Challenges, and Risks and how to handle them?

Ml masterclass

Beat the Benchmark.

Artur Suchwalko “What are common mistakes in Data Science projects and how to...

Closing The Loop for Evaluating Big Data Analysis

Evaluation of big data analysis

Symposium 2019 : Gestion de projet en Intelligence Artificielle

Better Living Through Analytics - Louis Cialdella Product School

DataOps - Lean principles and lean practices

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...

Meet the Ghost of SecOps Future by Anton Chuvakin

Recently uploaded

INTRODUCTION TO Natural language processingsocarem879

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Principles and Practices of Data VisualizationKianJazayeri1

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Learn How Data Science Changes Our WorldEduminds Learning

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Easter Eggs From Star Wars and in cars 1 and 217djon017

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Cyber awareness ppt on the recorded dataTecnoIncentive

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Recently uploaded (20)

INTRODUCTION TO Natural language processing

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Principles and Practices of Data Visualization

Student profile product demonstration on grades, ability, well-being and mind...

Semantic Shed - Squashing and Squeezing.pptx

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Learn How Data Science Changes Our World

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Advanced Machine Learning for Business Professionals

Defining Constituents, Data Vizzes and Telling a Data Story

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Easter Eggs From Star Wars and in cars 1 and 2

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Data Factory in Microsoft Fabric (MsBIP #82)

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Cyber awareness ppt on the recorded data

Real-Time AI Streaming - AI Max Princeton

Anomaly Detection made easy in Scala

1. Anomaly Detection made easy Piotr Guzik

2. $whoami 2 ● Data Engineer @Allegro (Scala, Kafka, Spark, Ansible, ML) ● Trainer @GetInData ● https://twitter.com/guzik_io ● Data Science flavour

3. Why anomaly detection is interesting ? 3 Anomaly detection on clickstream is all about: ● SLA (data should be a first-class citizen) ● You should be the first to know if something is wrong “Engineers in XX Century made mistakes but never by more than one order of magnitude. In IT it is not that good.”

4. Motivation and goals 4 Goal: Get quick information if the data is lost ● Losing data is somehow similar to losing money ● “You cannot improve what you cannot measure” ● Team responsible for given service should be alerted when something is wrong

5. How to start ? - important questions 5 ● How to get the data ? ● Real-time detection ? ● Delay ? ● What is an anomaly ?

6. Discovering datasource and data itself 6 ● Datasource - Druid ● OLAP cube dimensions as domains ● Data aggregated every 15 minutes ● Metric - simplest count

7. What is a core data ? 7 Data ~= result of the query: ● select count (*) as cnt, category,action,time_window_15_m from page_views where category = ‘Search’ and action = ’ShowItem’ group by category, action, time_window_15_m

8. First look at the data 8

9. Knowing the data 9 ● Clickstream is periodical ● Week == period ● Days of week differs a lot ● There is a rapid increase in web traffic about 6PM and it starts to fall at about 10PM

10. Research 10 Motto: Solution must be easy. Not only for data scientist. Available solutions: ● Twitter library - too hard, heavy math, many hyperparameters ● HTM algorithms - way too hard, neural networks, deep learning, very hard to reason about algorithm and its results We have to create our own simple model

11. How our model should be ? 11 Perfect model: ● Simple ● Time aware ● Detection is in minutes rather than hours ● Adapt to trends (ads, currently popular items) ● Should not report too many false-positives ● Use confidence intervals

12. Best tool for inventing algorithm 12

13. Model draft 13

14. F.A.I.L. - first attempt in learning 14 Simple statistical model in R First results: ● Rapid change of metric is a problem ● Trend is important but cannot lead to overfitting

15. Experimenting in progress 15 After model evolution: ● Outliers are problematic (sd) ● Outliers == duplicates of data on HDFS (thank you Camus!) ● Percentiles are great for outliers removal

16. Problems with R 16 ● Only Data Scientist knows R ● There is not an easy way to deploy it ● You cannot monitor it easily ● It is hard to maintain Decision: we have to rewrite it. From scratch. In Scala.

17. Input from Druid 17

18. Model 18

19. Some math (ema !) 19

20. Trend (fast changing world) 20

21. Learning is a difficult process 21 What if we learned something that is not valid anymore ? Mean could be bad, but what about ema ?

22. Anomaly Detection - almost there ? 22

23. Anomaly Detection - did we miss something ? 23 ● Long lasting anomaly is not an anomaly anymore ● Loss of data is crucial ● Output should be easy to understand

24. Long lasting anomalies - key concepts 24 Output: probability (with sign) of anomaly ● Small anomalies should be smoothen and larger should be outraged (monitoring and alerting) ● We define where obvious anomalies starts ● We define after how long we should treat anomalies as a norm (be careful here)

25. Long lasting anomalies - fix 25 In case of long lasting anomalies, we multiply all model params, as if we were wrong from the beginning

26. Deployment 26 SaaS model ● Multiple deployments with same codebase ● Different configuration ● Clients define how they want to react

27. Configuration example 27

28. Whole team - thank you 28 It was more than just me and my team involved in this process: Big thanks to: ● My team for motivation and hot discussions :) ● Paweł Zawistowski - initial model in R ● Other teams for real use cases (that is why you would like to be in production quickly)

29. Thank you Q & A Piotr Guzik

Anomaly Detection made easy in Scala

Recommended

Recommended

More Related Content

Similar to Anomaly Detection made easy in Scala

Similar to Anomaly Detection made easy in Scala (20)

More from Evention

More from Evention (20)

Recently uploaded

Recently uploaded (20)

Anomaly Detection made easy in Scala