SlideShare a Scribd company logo
1 of 39
Download to read offline
Location:
QuantUniversity Meetup
July 11th 2016
Boston MA
Outlier Analysis for Temporal Datasets
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2
Slides and Code available at:
http://www.analyticscertificate.com/Anomaly/
3
• 6.30-7.15 – Anomaly Detection part II
• 7.15-8.00 - Azure ML Example
Agenda
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers (Shell, Firstfuel Software etc.)
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
5
6
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program later in Fall
(MATLAB version also available)
8
• July
▫ 11th : QuantUniversity’s 2nd meetup
 Topic : Quantitative methods topic : TBD
• August
▫ 1st and 2nd : 2-day workshop on Anomaly Detection
 Registration and pricing details at www.analyticscertificate.com/Anomaly
▫ 8th : QuantUniversity meetup
▫ 14-20th : ARPM in New York www.arpm.co
 QuantUniversity presenting on Model Risk on August 14th
▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-
data-bootcamp/event.html
▫ Use promotional code SPEAKERREF to receive $200 discount on or before July
22nd
Events of Interest
9
• July
▫ Anomaly Detection Part II
• August
▫ Anomaly Detection Workshop
▫ Model Evaluation : Metrics, Scaling and Best Practices
• September
▫ What’s missing ? Best practices in missing data analysis
QuantUniversity’s Summer workshop series
10
What is anomaly detection?
• Anomalies or outliers are data points that appear to deviate
markedly from expected outputs.
• It is the process of finding patterns in data that don’t
conform to a prior expected behavior.
11
12
• Fraud Detection
• Stock market
• E-commerce
Examples
Part 1: Summary
13
We have covered Anomaly detection
Introduction  Definition of anomaly detection and its importance in energy systems
 Different types of anomaly detection methods: Statistical, graphical and machine
learning methods
Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol
plot to demonstrate outliers graphically
 The main assumption for applying graphical approaches is multivariate normality
 Mahalanobis distance methods is mainly used for calculating the distance of a point
from a center of multivariate distribution
Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test
 Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection
 Piece wised or segmented regression can be used to identify outliers based on the
residuals for each segment
 In K-means clustering method outliers are defined as points which have doesn’t belong
to any cluster, are far away from the centroids of the cluster or shaping sparse clusters
Anomaly Detection Part II : Dealing with Temporal Data
• In time series datasets, the assumption of temporal continuity plays
an important role in defining and detecting outliers.
• When analyzing single time series, the lack of temporal continuity
with immediate neighbors signal outliers. For example:
▫ A significant increase/decrease in value when compared with
immediate neighboring values . Example: Stock charts
• When analyzing multidimensional time series streams, temporal
continuity is much weaker. For example:
▫ Novel outliers that differ from aggregate trends. Example : Novel client
traffic from a new location in Google analytics
Point anomalies
• Points that or outside of “normal” points
Contextual anomalies
• Time is a contextual attribute that
determines the position of an instance
on the entire sequence.
• 145 point drop is not rare
but it is an anomaly if the drop happens
in a period of 3 minutes
Ref: http://www.bloomberg.com/news/articles/2013-04-23/fake-report-erasing-
136-billion-shows-market-s-fragility?cmpid=yhoo
Nuances in Time series analysis
• Time Series Analysis
▫ Numbers across time
▫ Example: Stock data
• Discrete sequences
▫ Labels across time
▫ Example: Log of client interactions
▫ http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-
web, ssh, smtp-mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp,
smtp-mail,http-web
Collective anomalies
• Here, a collection of related data instances is anomalous with
respect to the entire data set
Ref: http://krebsonsecurity.com/2010/10/pill-gang-used-microsofts-network-to-attack-krebsonsecurity-com/
Challenges
• Defining what is normal and what isn’t
Challenges
• The notion of normal behavior keeps evolving
Challenges
• The magnitude of the anomaly may be different
Challenges
• Labels may not be available
Challenges
• Noise may manifest as anomalies and it may be difficult to identify
and remove.
Methods for Anomaly Detection
Univariate
data
• Point outlier scenario:
• Statistical methods (ARIMA, Seasonal Hybrid ESD test method, E-
Divisive with medians, LOESS regression)
• Data mining methods (Multi layer perceptron)
• Outlier subsequences scenario:
• Windows based method
• Distance based method(PAA, SAX and HOTSAX)
Multivariate
data
• Statistical methods:
• Cook’s distance
• Bonferroni’s test
• Distance based methods:
• Local Outlier Factors (LOF)
• Data mining methods:
• Clustering algorithms (Hierarchical and K-Means)
Methods for Anomaly Detection
Database time
series univariate
and multivariate
data
• Density approach for principal components
• Graphical methods:
• Bivariate and functional bag plots
• Bivariate and functional HDR box plots
• Clustering methods
• Euclidean, correlation, autocorrelation and Wavelet
transform metrics
Censored survival
data
• Statistical methods:
• Residual based algorithm
• Scoring algorithm
26
• Point Outliers
▫ Prediction models
▫ Profile Similarity-based approaches and Deviants
• Subsequence Outliers
▫ Discord discovery
Single Time Series – Sample approaches
27
• Input: A time series t
• Output: Outlier points in t
Prediction Models: Compute outlier scores as deviation from
predicted value
• Median :
▫ Choose a window size k
▫ Compute median in the window t-k and t+k
• Mean:
▫ Choose a window size k
▫ Compute mean in the window t-k and t+k
Point outliers
28
• ARIMA framework
Point outliers : Prediction Models
29
• Neural Networks
▫ MLP predictor
Point outliers : Prediction Models
Original data
Fitted data
Boundaries
Any data points
that are beyond
the boundaries are
considered as
outliers
30
• Create a Normal profile (Example: MLP/AR etc. ) and notion of
variance
• Estimate the next point
• Compare realized value with the estimated point.
▫ If within band, normal
▫ Else, Outlier
Point outliers : Profile Similarity-Based Approach
31
• Find points in a given time series whose removal from the time
series results in a more succinct representation of the data
Point outliers : Deviant Approach
32
• Input: A time series t
• Output: Outlier subsequences in t
• Problem: Given t, and subsequence of length n, find outlier D that
has the largest distance to its nearest non-overlapping match
• In particular, given two subsequences of length n denoted by A = (a1
. . . an) and B = (b1 . . . bn), the Euclidean distance between them
can be computed as follows:
• Dist A, B = σi=1
n
(ai − bi)2
Subsequence outliers:
33
• The standard way of discretizing the time series: Symbolic
Approximation (SAX)
• The brute force solution is to consider all possible subsequences and
compute the distance of each such subsequence with each other
non-overlapping subsequences.
• Several optimizations
▫ HOT-SAX (Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most
unusual time series subsequence. Proceeding ICDM '05 Proceedings of
the Fifth IEEE International Conference on Data Mining)
SAX
• Plotting the discords
Outlier subsequences (Distance based)
The top discord which
has the largest distance
is 411th time series
point.
Summary
We have covered Anomaly detection
Univariate data  Statistical methods (ARIMA, Seasonal Hybrid ESD test method, EMD and LOESS
regression)
 Data mining methods (Multi layer perceptron)
 Outlier subsequences (Windows and distance based methods)
Multivariate data  Cook’s distance
 Bonferroni’s test
 Local outlier factor (LOF)
 Hierarchical and K-means clustering outlier detection methods
Database time series  Database time series definition
 Density approach for two first principle component scores
 Bivariate and functional bag plots
 Bivariate and functional HDR box plot
 Clustering time series
Censored survival data  Censored survival data definition
 Residual based algorithm
 Scoring algorithm
37
Register here:
https://www.eventbrite.com/e/anomaly-detection-workshop-tickets-25910035614?ref=ebtnebtckt
Affiliate discount pricing for QuantUniversity Meetup members and Academics!
When: August 1st and 2nd
Where: 1 Roger St, Cambridge MA
(IBM’s offices)
Time : 9-5.00pm
38
Q&A
Slides, code and details about the Anomaly detection workshop
at: http://www.analyticscertificate.com/Anomaly/
Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
39

More Related Content

What's hot

Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Matthias Braunhofer
 

What's hot (19)

Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Ds for finance day 2
Ds for finance day 2Ds for finance day 2
Ds for finance day 2
 
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
Parsimonious and Adaptive Contextual Information Acquisition in Recommender S...
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Feature Selection for Document Ranking
Feature Selection for Document RankingFeature Selection for Document Ranking
Feature Selection for Document Ranking
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
 
Hybridisation Techniques for Cold-Starting Context-Aware Recommender Systems
Hybridisation Techniques for Cold-Starting Context-Aware Recommender SystemsHybridisation Techniques for Cold-Starting Context-Aware Recommender Systems
Hybridisation Techniques for Cold-Starting Context-Aware Recommender Systems
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Pattern recognition UNIT 5
Pattern recognition UNIT 5Pattern recognition UNIT 5
Pattern recognition UNIT 5
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Keynote by Agus Sudjianto, Wells Fargo - Interpretable Machine Learning - H2O...
Keynote by Agus Sudjianto, Wells Fargo - Interpretable Machine Learning - H2O...Keynote by Agus Sudjianto, Wells Fargo - Interpretable Machine Learning - H2O...
Keynote by Agus Sudjianto, Wells Fargo - Interpretable Machine Learning - H2O...
 
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
 

Similar to Outlier analysis for Temporal Datasets

Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
Besnik Fetahu
 

Similar to Outlier analysis for Temporal Datasets (20)

Discovering signal in financial time series- where and how to start
Discovering signal in financial time series- where and how to startDiscovering signal in financial time series- where and how to start
Discovering signal in financial time series- where and how to start
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
 
Easy deployable-parking-percom-v6.1
Easy deployable-parking-percom-v6.1Easy deployable-parking-percom-v6.1
Easy deployable-parking-percom-v6.1
 
Emergency response behaviour data collection issue
Emergency response behaviour data collection issueEmergency response behaviour data collection issue
Emergency response behaviour data collection issue
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit Risk
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
forecasting model
forecasting modelforecasting model
forecasting model
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 

More from QuantUniversity

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 

More from QuantUniversity (20)

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSPYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
 
Qu for India - QuantUniversity FundRaiser
Qu for India  - QuantUniversity FundRaiserQu for India  - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiser
 
Ml master class for CFA Dallas
Ml master class for CFA DallasMl master class for CFA Dallas
Ml master class for CFA Dallas
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementAI Explainability and Model Risk Management
AI Explainability and Model Risk Management
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
 
Bayesian Portfolio Allocation
Bayesian Portfolio AllocationBayesian Portfolio Allocation
Bayesian Portfolio Allocation
 
The API Jungle
The API JungleThe API Jungle
The API Jungle
 
Explainable AI Workshop
Explainable AI WorkshopExplainable AI Workshop
Explainable AI Workshop
 
Constructing Private Asset Benchmarks
Constructing Private Asset BenchmarksConstructing Private Asset Benchmarks
Constructing Private Asset Benchmarks
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Responsible AI in Action
Responsible AI in ActionResponsible AI in Action
Responsible AI in Action
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Qwafafew meeting 5
Qwafafew meeting 5Qwafafew meeting 5
Qwafafew meeting 5
 

Recently uploaded

Recently uploaded (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Outlier analysis for Temporal Datasets

  • 1. Location: QuantUniversity Meetup July 11th 2016 Boston MA Outlier Analysis for Temporal Datasets 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com sri@quantuniversity.com
  • 2. 2 Slides and Code available at: http://www.analyticscertificate.com/Anomaly/
  • 3. 3 • 6.30-7.15 – Anomaly Detection part II • 7.15-8.00 - Azure ML Example Agenda
  • 4. - Analytics Advisory services - Custom training programs - Architecture assessments, advice and audits
  • 5. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.) • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 5
  • 6. 6 Quantitative Analytics and Big Data Analytics Onboarding • Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Launching the Analytics Certificate Program later in Fall
  • 7. (MATLAB version also available)
  • 8. 8 • July ▫ 11th : QuantUniversity’s 2nd meetup  Topic : Quantitative methods topic : TBD • August ▫ 1st and 2nd : 2-day workshop on Anomaly Detection  Registration and pricing details at www.analyticscertificate.com/Anomaly ▫ 8th : QuantUniversity meetup ▫ 14-20th : ARPM in New York www.arpm.co  QuantUniversity presenting on Model Risk on August 14th ▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big- data-bootcamp/event.html ▫ Use promotional code SPEAKERREF to receive $200 discount on or before July 22nd Events of Interest
  • 9. 9 • July ▫ Anomaly Detection Part II • August ▫ Anomaly Detection Workshop ▫ Model Evaluation : Metrics, Scaling and Best Practices • September ▫ What’s missing ? Best practices in missing data analysis QuantUniversity’s Summer workshop series
  • 10. 10
  • 11. What is anomaly detection? • Anomalies or outliers are data points that appear to deviate markedly from expected outputs. • It is the process of finding patterns in data that don’t conform to a prior expected behavior. 11
  • 12. 12 • Fraud Detection • Stock market • E-commerce Examples
  • 13. Part 1: Summary 13 We have covered Anomaly detection Introduction  Definition of anomaly detection and its importance in energy systems  Different types of anomaly detection methods: Statistical, graphical and machine learning methods Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate outliers graphically  The main assumption for applying graphical approaches is multivariate normality  Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of multivariate distribution Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test  Statistical methods may use either scores or p-value as threshold to detect outliers Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection  Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment  In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far away from the centroids of the cluster or shaping sparse clusters
  • 14. Anomaly Detection Part II : Dealing with Temporal Data • In time series datasets, the assumption of temporal continuity plays an important role in defining and detecting outliers. • When analyzing single time series, the lack of temporal continuity with immediate neighbors signal outliers. For example: ▫ A significant increase/decrease in value when compared with immediate neighboring values . Example: Stock charts • When analyzing multidimensional time series streams, temporal continuity is much weaker. For example: ▫ Novel outliers that differ from aggregate trends. Example : Novel client traffic from a new location in Google analytics
  • 15. Point anomalies • Points that or outside of “normal” points
  • 16. Contextual anomalies • Time is a contextual attribute that determines the position of an instance on the entire sequence. • 145 point drop is not rare but it is an anomaly if the drop happens in a period of 3 minutes Ref: http://www.bloomberg.com/news/articles/2013-04-23/fake-report-erasing- 136-billion-shows-market-s-fragility?cmpid=yhoo
  • 17. Nuances in Time series analysis • Time Series Analysis ▫ Numbers across time ▫ Example: Stock data • Discrete sequences ▫ Labels across time ▫ Example: Log of client interactions ▫ http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http- web, ssh, smtp-mail, http-web, ssh, buffer-overflow, ftp, http-web, ftp, smtp-mail,http-web
  • 18. Collective anomalies • Here, a collection of related data instances is anomalous with respect to the entire data set Ref: http://krebsonsecurity.com/2010/10/pill-gang-used-microsofts-network-to-attack-krebsonsecurity-com/
  • 19. Challenges • Defining what is normal and what isn’t
  • 20. Challenges • The notion of normal behavior keeps evolving
  • 21. Challenges • The magnitude of the anomaly may be different
  • 22. Challenges • Labels may not be available
  • 23. Challenges • Noise may manifest as anomalies and it may be difficult to identify and remove.
  • 24. Methods for Anomaly Detection Univariate data • Point outlier scenario: • Statistical methods (ARIMA, Seasonal Hybrid ESD test method, E- Divisive with medians, LOESS regression) • Data mining methods (Multi layer perceptron) • Outlier subsequences scenario: • Windows based method • Distance based method(PAA, SAX and HOTSAX) Multivariate data • Statistical methods: • Cook’s distance • Bonferroni’s test • Distance based methods: • Local Outlier Factors (LOF) • Data mining methods: • Clustering algorithms (Hierarchical and K-Means)
  • 25. Methods for Anomaly Detection Database time series univariate and multivariate data • Density approach for principal components • Graphical methods: • Bivariate and functional bag plots • Bivariate and functional HDR box plots • Clustering methods • Euclidean, correlation, autocorrelation and Wavelet transform metrics Censored survival data • Statistical methods: • Residual based algorithm • Scoring algorithm
  • 26. 26 • Point Outliers ▫ Prediction models ▫ Profile Similarity-based approaches and Deviants • Subsequence Outliers ▫ Discord discovery Single Time Series – Sample approaches
  • 27. 27 • Input: A time series t • Output: Outlier points in t Prediction Models: Compute outlier scores as deviation from predicted value • Median : ▫ Choose a window size k ▫ Compute median in the window t-k and t+k • Mean: ▫ Choose a window size k ▫ Compute mean in the window t-k and t+k Point outliers
  • 28. 28 • ARIMA framework Point outliers : Prediction Models
  • 29. 29 • Neural Networks ▫ MLP predictor Point outliers : Prediction Models Original data Fitted data Boundaries Any data points that are beyond the boundaries are considered as outliers
  • 30. 30 • Create a Normal profile (Example: MLP/AR etc. ) and notion of variance • Estimate the next point • Compare realized value with the estimated point. ▫ If within band, normal ▫ Else, Outlier Point outliers : Profile Similarity-Based Approach
  • 31. 31 • Find points in a given time series whose removal from the time series results in a more succinct representation of the data Point outliers : Deviant Approach
  • 32. 32 • Input: A time series t • Output: Outlier subsequences in t • Problem: Given t, and subsequence of length n, find outlier D that has the largest distance to its nearest non-overlapping match • In particular, given two subsequences of length n denoted by A = (a1 . . . an) and B = (b1 . . . bn), the Euclidean distance between them can be computed as follows: • Dist A, B = σi=1 n (ai − bi)2 Subsequence outliers:
  • 33. 33 • The standard way of discretizing the time series: Symbolic Approximation (SAX) • The brute force solution is to consider all possible subsequences and compute the distance of each such subsequence with each other non-overlapping subsequences. • Several optimizations ▫ HOT-SAX (Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most unusual time series subsequence. Proceeding ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining) SAX
  • 34. • Plotting the discords Outlier subsequences (Distance based) The top discord which has the largest distance is 411th time series point.
  • 35. Summary We have covered Anomaly detection Univariate data  Statistical methods (ARIMA, Seasonal Hybrid ESD test method, EMD and LOESS regression)  Data mining methods (Multi layer perceptron)  Outlier subsequences (Windows and distance based methods) Multivariate data  Cook’s distance  Bonferroni’s test  Local outlier factor (LOF)  Hierarchical and K-means clustering outlier detection methods Database time series  Database time series definition  Density approach for two first principle component scores  Bivariate and functional bag plots  Bivariate and functional HDR box plot  Clustering time series Censored survival data  Censored survival data definition  Residual based algorithm  Scoring algorithm
  • 36.
  • 37. 37 Register here: https://www.eventbrite.com/e/anomaly-detection-workshop-tickets-25910035614?ref=ebtnebtckt Affiliate discount pricing for QuantUniversity Meetup members and Academics! When: August 1st and 2nd Where: 1 Roger St, Cambridge MA (IBM’s offices) Time : 9-5.00pm
  • 38. 38 Q&A Slides, code and details about the Anomaly detection workshop at: http://www.analyticscertificate.com/Anomaly/
  • 39. Thank you! Members & Sponsors! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 39