SlideShare a Scribd company logo
1 of 22
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1
Symbolic Representations of Time Series
- Nikita
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Time Series
 A time series is a sequence of pairs
- Each pair consists of a Time Index and a Value
- The Time Index may be implied if there is a constant difference
between values
 The time series can be segmented into “Windows” which represent
the time series between 2 Time Indices
 Symbols can represent Windows. Because symbols in a Finite
Symbol Space have a probability, we can think of the probability of a
time series. Symbols are easy to store and manipulate – each
symbol can be represented as an integer
Oracle Confidential – Internal/Restricted/Highly Restricted 2
0 2000 4000 6000 8000
0
10
20
30
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Data Mining Constraints
Oracle Confidential – Internal/Restricted/Highly Restricted 3
For example, suppose
you have one gig of
main memory and
want to do K-means
clustering…Clustering ¼ gig of data, 100 sec
Clustering ½ gig of data, 200 sec
Clustering 1 gig of data, 400 sec
Clustering 1.1 gigs of data, few hours
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Generic Data Mining
 Create an approximation of the data, which will fit in main memory,
yet retains the essential features of interest
 Approximately solve the problem at hand in main memory
 Make (hopefully very few) accesses to the original data on disk to
confirm the solution
Oracle Confidential – Internal/Restricted/Highly Restricted 4
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 5
Some Common Approximation
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
The Symbolic Representation Of Time Series
A number of algorithms exist to represent time series as symbols in a Finite
Symbol Space
 These algorithms are often though of as “Feature Reducers”
Self Organizing Maps are a traditional form of Feature Reducer
SAX (Symbolic Aggregate approXimation) is another, designed specifically for
time series
There are many other ways to reduce a time series to symbol
 As long as the symbol is drawn from a Finite Symbol Space, the technique
described here will work
Oracle Confidential – Internal/Restricted/Highly Restricted 6
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What is SAX?
 SAX is a methodology for reducing a time series window to a symbol
 The technique was developed by Dr. Eamonn Keogh et al. at the University of
California at Riverside in the early 2000’s
 It has since drawn a great deal of attention in the world of time series analysis
 Allows a time series of arbitrary length n to be reduced to a string of arbitrary
length w (w<<n)
 SAX is the first symbolic representation for time series that allows for
dimensionality reduction and indexing with a lower-bounding distance measure.
Oracle Confidential – Internal/Restricted/Highly Restricted 7
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What is lower bounding?
Oracle Confidential – Internal/Restricted/Highly Restricted 8
 Lower bounding means that for all Q and S, we have DLB(Q’,S’) <= D(Q,S).
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What’s a SAX Word?
A SAX word is the symbol generated by the SAX algorithm
It is defined by a SAX Alphabet and a length
 The SAX Alphabet is traditionally represented by letters, and its components
are referred to as “SAX Letters”
 The size of the alphabet is typically small – this is particularly important for
anomaly detection
When we write out a description of a SAX word, we typically use a string like
representation, such as “abcdefg”
 SAX letters don’t have to be letters – implementations often use numbers
based at zero, however, we often display them as letters
Oracle Confidential – Internal/Restricted/Highly Restricted 9
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Symbolic Aggregate ApproXimation
Lower bounding of Euclidean distance
Dimensionality Reduction
Numerosity Reduction
Oracle Confidential – Internal/Restricted/Highly Restricted 10
baabccbc
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Normalization of Time Series
 Normalization to Zero Mean and Unit of Energy.
 The procedure ensures, that all elements of the input vector are transformed
into the output vector whose mean is approximately 0 while the standard
deviation is in a range close to 1. The formula behind the transform is shown
below:
 Z-normalization is an essential preprocessing step which allows an algorithm to
focus on the structural similarities/dissimilarities rather than on the amplitude.
In order to make meaningful comparisons between two time series, both must
be normalized.
Oracle Confidential – Internal/Restricted/Highly Restricted 11
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 12
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How to obtain SAX?
 Data is divided into w equal sized frames.
 Mean value of the data falling within a frame is calculated
 Vector of these values becomes the PAA
Oracle Confidential – Internal/Restricted/Highly Restricted 13
0
--
0 20 40 60 80 100 120
bb
b
a
c
c
c
a
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How to obtain SAX?
Step 1: Reduce dimension by PAA
Time series C of length n can be represented in a w-dimensional space by a vector
Ć = ć1,…ćw
The ith element is calculated by
Oracle Confidential – Internal/Restricted/Highly Restricted 14
 

i
ij
jn
w
i
w
n
w
n
cc
1)1(
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How to obtain SAX?
Step 2: Discretization
Normalize Ć to have a Gaussian distribution
Determine breakpoints that will produce a equal-sized areas under
Gaussian curve
Oracle Confidential – Internal/Restricted/Highly Restricted 15
0
--
0 20 40 60 80 100 120
bb
b
a
c
c
c
a
baabccbc
Words: 8
Alphabet: 3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Gaussian distribution
 Most "natural" distributions
 A Gaussian process uses lazy learning and a measure of the similarity between
points (this is the kernel function) to predict the value for an unseen point from
training data
Oracle Confidential – Internal/Restricted/Highly Restricted 17
Ref : https://www.isixsigma.com/tools-templates/normality/tips-recognizing-and-transforming-
non-normal-data/
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distance Measure
Oracle Confidential – Internal/Restricted/Highly Restricted 18
• Given 2 time series Q and C
– Euclidean distance
– Distance after transforming the subsequence to PAA
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distance Measure
Oracle Confidential – Internal/Restricted/Highly Restricted 19
• Given 2 time series Q and C
– Euclidean distance
– Distance after transforming the subsequence to PAA
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Distance Measure
Define MINDIST after transforming to symbolic representation
MINDIST lower bounds the true distance between the original time
series
Oracle Confidential – Internal/Restricted/Highly Restricted 20
baabccbcCˆ
babcaccaQˆ
  

w
i iiw
n
cqdistCQMINDIST 1
2
)ˆ,ˆ()ˆ,ˆ(
dist() can be implemented using a
table lookup.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Novelty Detection
 Fault detection
 Interestingness detection
 Anomaly detection
 Surprisingness detection
Oracle Confidential – Internal/Restricted/Highly Restricted 21
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 22

More Related Content

What's hot

K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 

What's hot (20)

05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
50120140505013
5012014050501350120140505013
50120140505013
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Clustering
ClusteringClustering
Clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Data clustering
Data clustering Data clustering
Data clustering
 
Birch
BirchBirch
Birch
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
K means clustering
K means clusteringK means clustering
K means clustering
 

Similar to SAX-TimeSeries

Kellyn Pot'Vin-Gorman - Awr and Ash
Kellyn Pot'Vin-Gorman - Awr and AshKellyn Pot'Vin-Gorman - Awr and Ash
Kellyn Pot'Vin-Gorman - Awr and Ashgaougorg
 
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014Nadine Schoene
 
Coherence 12.1.3 hidden gems
Coherence 12.1.3 hidden gemsCoherence 12.1.3 hidden gems
Coherence 12.1.3 hidden gemsharvraja
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8IndicThreads
 
Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Simon Ritter
 
AWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cAWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cKellyn Pot'Vin-Gorman
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
 
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud ServiceOracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud ServiceJean-Philippe PINTE
 
IOUG at Coors Field ASH and AWR in EM12c!
IOUG at Coors Field ASH and AWR in EM12c!IOUG at Coors Field ASH and AWR in EM12c!
IOUG at Coors Field ASH and AWR in EM12c!Kellyn Pot'Vin-Gorman
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewPaulo Fagundes
 
Oracle RAC 12c Rel. 2 for Continuous Availability
Oracle RAC 12c Rel. 2 for Continuous AvailabilityOracle RAC 12c Rel. 2 for Continuous Availability
Oracle RAC 12c Rel. 2 for Continuous AvailabilityMarkus Michalewicz
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit
 
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...Nadine Schoene
 

Similar to SAX-TimeSeries (20)

Developer day v2
Developer day v2Developer day v2
Developer day v2
 
AWR and ASH in an EM12c World
AWR and ASH in an EM12c WorldAWR and ASH in an EM12c World
AWR and ASH in an EM12c World
 
AWR and ASH Deep Dive
AWR and ASH Deep DiveAWR and ASH Deep Dive
AWR and ASH Deep Dive
 
Kellyn Pot'Vin-Gorman - Awr and Ash
Kellyn Pot'Vin-Gorman - Awr and AshKellyn Pot'Vin-Gorman - Awr and Ash
Kellyn Pot'Vin-Gorman - Awr and Ash
 
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014
 
Coherence 12.1.3 hidden gems
Coherence 12.1.3 hidden gemsCoherence 12.1.3 hidden gems
Coherence 12.1.3 hidden gems
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8
 
AWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12cAWR and ASH Advanced Usage with DB12c
AWR and ASH Advanced Usage with DB12c
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014
 
UKOUG
UKOUG UKOUG
UKOUG
 
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud ServiceOracle Cloud Storage Service & Oracle Database Backup Cloud Service
Oracle Cloud Storage Service & Oracle Database Backup Cloud Service
 
IOUG at Coors Field ASH and AWR in EM12c!
IOUG at Coors Field ASH and AWR in EM12c!IOUG at Coors Field ASH and AWR in EM12c!
IOUG at Coors Field ASH and AWR in EM12c!
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overview
 
OOW-TBE-12c-CON7307-Sharable
OOW-TBE-12c-CON7307-SharableOOW-TBE-12c-CON7307-Sharable
OOW-TBE-12c-CON7307-Sharable
 
AWR, ASH with EM13 at HotSos 2016
AWR, ASH with EM13 at HotSos 2016AWR, ASH with EM13 at HotSos 2016
AWR, ASH with EM13 at HotSos 2016
 
Oracle RAC 12c Rel. 2 for Continuous Availability
Oracle RAC 12c Rel. 2 for Continuous AvailabilityOracle RAC 12c Rel. 2 for Continuous Availability
Oracle RAC 12c Rel. 2 for Continuous Availability
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
 
OOW13 JB KP ASH Deep Dive
OOW13 JB KP ASH Deep DiveOOW13 JB KP ASH Deep Dive
OOW13 JB KP ASH Deep Dive
 
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...
Slidedeck Mehr als Reporting - Datenanalysen mit Oracle R Enterprise - DOAG D...
 

SAX-TimeSeries

  • 1. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1 Symbolic Representations of Time Series - Nikita
  • 2. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Time Series  A time series is a sequence of pairs - Each pair consists of a Time Index and a Value - The Time Index may be implied if there is a constant difference between values  The time series can be segmented into “Windows” which represent the time series between 2 Time Indices  Symbols can represent Windows. Because symbols in a Finite Symbol Space have a probability, we can think of the probability of a time series. Symbols are easy to store and manipulate – each symbol can be represented as an integer Oracle Confidential – Internal/Restricted/Highly Restricted 2 0 2000 4000 6000 8000 0 10 20 30
  • 3. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Data Mining Constraints Oracle Confidential – Internal/Restricted/Highly Restricted 3 For example, suppose you have one gig of main memory and want to do K-means clustering…Clustering ¼ gig of data, 100 sec Clustering ½ gig of data, 200 sec Clustering 1 gig of data, 400 sec Clustering 1.1 gigs of data, few hours
  • 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Generic Data Mining  Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest  Approximately solve the problem at hand in main memory  Make (hopefully very few) accesses to the original data on disk to confirm the solution Oracle Confidential – Internal/Restricted/Highly Restricted 4
  • 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 5 Some Common Approximation
  • 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | The Symbolic Representation Of Time Series A number of algorithms exist to represent time series as symbols in a Finite Symbol Space  These algorithms are often though of as “Feature Reducers” Self Organizing Maps are a traditional form of Feature Reducer SAX (Symbolic Aggregate approXimation) is another, designed specifically for time series There are many other ways to reduce a time series to symbol  As long as the symbol is drawn from a Finite Symbol Space, the technique described here will work Oracle Confidential – Internal/Restricted/Highly Restricted 6
  • 7. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What is SAX?  SAX is a methodology for reducing a time series window to a symbol  The technique was developed by Dr. Eamonn Keogh et al. at the University of California at Riverside in the early 2000’s  It has since drawn a great deal of attention in the world of time series analysis  Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)  SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. Oracle Confidential – Internal/Restricted/Highly Restricted 7
  • 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What is lower bounding? Oracle Confidential – Internal/Restricted/Highly Restricted 8  Lower bounding means that for all Q and S, we have DLB(Q’,S’) <= D(Q,S).
  • 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What’s a SAX Word? A SAX word is the symbol generated by the SAX algorithm It is defined by a SAX Alphabet and a length  The SAX Alphabet is traditionally represented by letters, and its components are referred to as “SAX Letters”  The size of the alphabet is typically small – this is particularly important for anomaly detection When we write out a description of a SAX word, we typically use a string like representation, such as “abcdefg”  SAX letters don’t have to be letters – implementations often use numbers based at zero, however, we often display them as letters Oracle Confidential – Internal/Restricted/Highly Restricted 9
  • 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Symbolic Aggregate ApproXimation Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction Oracle Confidential – Internal/Restricted/Highly Restricted 10 baabccbc
  • 11. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Normalization of Time Series  Normalization to Zero Mean and Unit of Energy.  The procedure ensures, that all elements of the input vector are transformed into the output vector whose mean is approximately 0 while the standard deviation is in a range close to 1. The formula behind the transform is shown below:  Z-normalization is an essential preprocessing step which allows an algorithm to focus on the structural similarities/dissimilarities rather than on the amplitude. In order to make meaningful comparisons between two time series, both must be normalized. Oracle Confidential – Internal/Restricted/Highly Restricted 11
  • 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 12
  • 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX?  Data is divided into w equal sized frames.  Mean value of the data falling within a frame is calculated  Vector of these values becomes the PAA Oracle Confidential – Internal/Restricted/Highly Restricted 13 0 -- 0 20 40 60 80 100 120 bb b a c c c a
  • 14. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX? Step 1: Reduce dimension by PAA Time series C of length n can be represented in a w-dimensional space by a vector Ć = ć1,…ćw The ith element is calculated by Oracle Confidential – Internal/Restricted/Highly Restricted 14    i ij jn w i w n w n cc 1)1(
  • 15. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX? Step 2: Discretization Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized areas under Gaussian curve Oracle Confidential – Internal/Restricted/Highly Restricted 15 0 -- 0 20 40 60 80 100 120 bb b a c c c a baabccbc Words: 8 Alphabet: 3
  • 16. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16
  • 17. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Gaussian distribution  Most "natural" distributions  A Gaussian process uses lazy learning and a measure of the similarity between points (this is the kernel function) to predict the value for an unseen point from training data Oracle Confidential – Internal/Restricted/Highly Restricted 17 Ref : https://www.isixsigma.com/tools-templates/normality/tips-recognizing-and-transforming- non-normal-data/
  • 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Oracle Confidential – Internal/Restricted/Highly Restricted 18 • Given 2 time series Q and C – Euclidean distance – Distance after transforming the subsequence to PAA
  • 19. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Oracle Confidential – Internal/Restricted/Highly Restricted 19 • Given 2 time series Q and C – Euclidean distance – Distance after transforming the subsequence to PAA
  • 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Define MINDIST after transforming to symbolic representation MINDIST lower bounds the true distance between the original time series Oracle Confidential – Internal/Restricted/Highly Restricted 20 baabccbcCˆ babcaccaQˆ     w i iiw n cqdistCQMINDIST 1 2 )ˆ,ˆ()ˆ,ˆ( dist() can be implemented using a table lookup.
  • 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Novelty Detection  Fault detection  Interestingness detection  Anomaly detection  Surprisingness detection Oracle Confidential – Internal/Restricted/Highly Restricted 21
  • 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 22

Editor's Notes

  1. A time series is a collection of observations made sequentially in time
  2. Researchers have proposed various methodologies to represent time series more efficicently, inclusing dimensionality reduction and numerosity reduction technique.  Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT) , while requiring less storage space Another line of research on time series representation focuses on converting numeric values into symbolic form. SAx adapts both DR and NR technquies
  3. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. Dimensionality reduction, e.g., remove unimportant attribute Dimensionality reduction ◦ Avoid the curse of dimensionality ◦ Help eliminate irrelevant features and reduce noise ◦ Reduce time and space required in data mining ◦ Allow easier visualization Numerosity reduction (some simply call it: Data Reduction) Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods (e.g., regression) ◦ Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) ◦ Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods ◦ Do not assume models ◦ Major families: histograms, clustering, sampling ◦ Data compression
  4. The values that have a larger scale will be given an increased weight (that the other components contribute as well.). Feature scaling is a pretty common normalization technique, and what I usually default to unless there is a reason to attempt another technique.   In order to make meaningful comparisons between two time series, both must be normalized. Data normalization (centering & scaling) tends to helps more with model convergence/stability when dealing with maching learning algorithms. . Feeding ML algorithms input data with wildly different mean/variance can slow or prevent model convergence. If you have multiple inputs, and the amplitudes of your inputs are different then it is better to normalize your inputs. In other words, if you have inputs with different means and variance, when you do normalization, you make all of them to have zero mean and one variance. Thus the weight of all input on the output becomes same. To do normalization you can subtract mean of each input from itself and then divide by its standard deviation.
  5. Compute the SAX letter by dividing the Standard Normal Distribution into K regions of equal area under the curve and assigning each component of the PAA a letter from the SAX Alphabet corresponding to the region indexed by the PAA value Repeating for each value of the PAA yields a SAX word of equivalent length to the PAA
  6. First convert the time series to PAA representation, then convert the PAA to symbols It take linear time
  7. First convert the time series to PAA representation, then convert the PAA to symbols. It take linear time Normalization to Zero Mean and Unit of Energy.  The procedure ensures, that all elements of the input vector are transformed into the output vector whose mean is approximately 0 while the standard deviation is in a range close to 1. The formula behind the transform is shown below: z-normalization is an essential preprocessing step which allows an algorithm to focus on the structural similarities/dissimilarities rather than on the amplitude.
  8. Compute the SAX letter by dividing the Standard Normal Distribution into K regions of equal area under the curve and assigning each component of the PAA a letter from the SAX Alphabet corresponding to the region indexed by the PAA value Repeating for each value of the PAA yields a SAX word of equivalent length to the PAA  It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure.  All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on.  Have  a look at Fig. 2 to see what is going on.
  9. Normal distribution f (x) = 1 σ √ 2π exp[−(x − µ) 2/2σ 2 ]. 2 Skewness = 1 n Pn i=1 (xi−x¯) 3 s 3 . 3 Kurtosis = 1 n Pn i=1 (xi−x¯) 4 s 4 . where x¯ is the mean, s is the standard deviation, and n is the length of time series. 4 Remarks Skewness is a measure of the asymmetry of the probability density function. This assignment is done by dividing the Standard Normal Distribution into K + 1 sections of equal area under the curve, and then assigning the letter corresponding to the point on the curve the value lies. This results in an array of length N, each component being a value between 0 and K, which can be treated as a symbol  It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure.  All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on.  Have  a look at Fig. 2 to see what is going on. Kurtosis is a measure of the flatness of the probability density function. The normal (Gaussian) distribution exhibits the zero skewness, and a kurtosis value of 3.
  10. Normal distribution f (x) = 1 σ √ 2π exp[−(x − µ) 2/2σ 2 ]. 2 Skewness = 1 n Pn i=1 (xi−x¯) 3 s 3 . 3 Kurtosis = 1 n Pn i=1 (xi−x¯) 4 s 4 . where x¯ is the mean, s is the standard deviation, and n is the length of time series. 4 Remarks Skewness is a measure of the asymmetry of the probability density function. This assignment is done by dividing the Standard Normal Distribution into K + 1 sections of equal area under the curve, and then assigning the letter corresponding to the point on the curve the value lies. This results in an array of length N, each component being a value between 0 and K, which can be treated as a symbol  It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure.  All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on.  Have  a look at Fig. 2 to see what is going on. Kurtosis is a measure of the flatness of the probability density function. The normal (Gaussian) distribution exhibits the zero skewness, and a kurtosis value of 3.
  11. First convert the time series to PAA representation, then convert the PAA to symbols It take linear time