Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

4,330 views

Published on

No Downloads

Total views

4,330

On SlideShare

0

From Embeds

0

Number of Embeds

48

Shares

0

Downloads

54

Comments

5

Likes

2

No notes for slide

Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT) , while requiring less storage space

Another line of research on time series representation focuses on converting numeric values into symbolic form.

SAx adapts both DR and NR technquies

Dimensionality reduction, e.g., remove unimportant attribute

Dimensionality reduction ◦ Avoid the curse of dimensionality ◦ Help eliminate irrelevant features and reduce noise ◦ Reduce time and space required in data mining ◦ Allow easier visualization

Numerosity reduction (some simply call it: Data Reduction)

Reduce data volume by choosing alternative, smaller forms of data representation

Parametric methods (e.g., regression) ◦ Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) ◦ Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods ◦ Do not assume models ◦ Major families: histograms, clustering, sampling

◦ Data compression

In order to make meaningful comparisons between two time series, both must be normalized.

Data normalization (centering & scaling) tends to helps more with model convergence/stability when dealing with maching learning algorithms. . Feeding ML algorithms input data with wildly different mean/variance can slow or prevent model convergence.

If you have multiple inputs, and the amplitudes of your inputs are different then it is better to normalize your inputs. In other words, if you have inputs with different means and variance, when you do normalization, you make all of them to have zero mean and one variance. Thus the weight of all input on the output becomes same. To do normalization you can subtract mean of each input from itself and then divide by its standard deviation.

It take linear time

Normalization to Zero Mean and Unit of Energy.

The procedure ensures, that all elements of the input vector are transformed into the output vector whose mean is approximately 0 while the standard deviation is in a range close to 1. The formula behind the transform is shown below:

z-normalization is an essential preprocessing step which allows an algorithm to focus on the structural similarities/dissimilarities rather than on the amplitude.

It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure. All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on. Have a look at Fig. 2 to see what is going on.

Skewness = 1 n Pn i=1 (xi−x¯) 3 s 3 . 3 Kurtosis = 1 n Pn i=1 (xi−x¯) 4 s 4 . where x¯ is the mean, s is the standard deviation, and n is the length of time series. 4 Remarks

Skewness is a measure of the asymmetry of the probability density function.

This assignment is done by dividing the Standard Normal Distribution into K + 1 sections of equal area under the curve, and then assigning the letter corresponding to the point on the curve the value lies. This results in an array of length N, each component being a value between 0 and K, which can be treated as a symbol

It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure. All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on. Have a look at Fig. 2 to see what is going on.

Kurtosis is a measure of the flatness of the probability density function.

The normal (Gaussian) distribution exhibits the zero skewness, and a kurtosis value of 3.

Skewness = 1 n Pn i=1 (xi−x¯) 3 s 3 . 3 Kurtosis = 1 n Pn i=1 (xi−x¯) 4 s 4 . where x¯ is the mean, s is the standard deviation, and n is the length of time series. 4 Remarks

Skewness is a measure of the asymmetry of the probability density function.

This assignment is done by dividing the Standard Normal Distribution into K + 1 sections of equal area under the curve, and then assigning the letter corresponding to the point on the curve the value lies. This results in an array of length N, each component being a value between 0 and K, which can be treated as a symbol

It is assumed that the normalised time series has a Gaussian distribution. Next the so-called 'breakpoints' are determined that will produce kequal-sized areas under the standard normal curve, shown with coloured dotted lines in the 2nd figure. All PAA coefficients that are below the smallest breakpoint are mapped to the symbol 'a', all coefficients greater than equal to the smallest breakpoint and less than the second-smallest breakpoint are mapped to the symbol 'b', and so on. Have a look at Fig. 2 to see what is going on.

Kurtosis is a measure of the flatness of the probability density function.

The normal (Gaussian) distribution exhibits the zero skewness, and a kurtosis value of 3.

It take linear time

- 1. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1 Symbolic Representations of Time Series - Nikita
- 2. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Time Series A time series is a sequence of pairs - Each pair consists of a Time Index and a Value - The Time Index may be implied if there is a constant difference between values The time series can be segmented into “Windows” which represent the time series between 2 Time Indices Symbols can represent Windows. Because symbols in a Finite Symbol Space have a probability, we can think of the probability of a time series. Symbols are easy to store and manipulate – each symbol can be represented as an integer Oracle Confidential – Internal/Restricted/Highly Restricted 2 0 2000 4000 6000 8000 0 10 20 30
- 3. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Data Mining Constraints Oracle Confidential – Internal/Restricted/Highly Restricted 3 For example, suppose you have one gig of main memory and want to do K-means clustering…Clustering ¼ gig of data, 100 sec Clustering ½ gig of data, 200 sec Clustering 1 gig of data, 400 sec Clustering 1.1 gigs of data, few hours
- 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Generic Data Mining Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution Oracle Confidential – Internal/Restricted/Highly Restricted 4
- 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 5 Some Common Approximation
- 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | The Symbolic Representation Of Time Series A number of algorithms exist to represent time series as symbols in a Finite Symbol Space These algorithms are often though of as “Feature Reducers” Self Organizing Maps are a traditional form of Feature Reducer SAX (Symbolic Aggregate approXimation) is another, designed specifically for time series There are many other ways to reduce a time series to symbol As long as the symbol is drawn from a Finite Symbol Space, the technique described here will work Oracle Confidential – Internal/Restricted/Highly Restricted 6
- 7. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What is SAX? SAX is a methodology for reducing a time series window to a symbol The technique was developed by Dr. Eamonn Keogh et al. at the University of California at Riverside in the early 2000’s It has since drawn a great deal of attention in the world of time series analysis Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n) SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. Oracle Confidential – Internal/Restricted/Highly Restricted 7
- 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What is lower bounding? Oracle Confidential – Internal/Restricted/Highly Restricted 8 Lower bounding means that for all Q and S, we have DLB(Q’,S’) <= D(Q,S).
- 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What’s a SAX Word? A SAX word is the symbol generated by the SAX algorithm It is defined by a SAX Alphabet and a length The SAX Alphabet is traditionally represented by letters, and its components are referred to as “SAX Letters” The size of the alphabet is typically small – this is particularly important for anomaly detection When we write out a description of a SAX word, we typically use a string like representation, such as “abcdefg” SAX letters don’t have to be letters – implementations often use numbers based at zero, however, we often display them as letters Oracle Confidential – Internal/Restricted/Highly Restricted 9
- 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Symbolic Aggregate ApproXimation Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction Oracle Confidential – Internal/Restricted/Highly Restricted 10 baabccbc
- 11. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Normalization of Time Series Normalization to Zero Mean and Unit of Energy. The procedure ensures, that all elements of the input vector are transformed into the output vector whose mean is approximately 0 while the standard deviation is in a range close to 1. The formula behind the transform is shown below: Z-normalization is an essential preprocessing step which allows an algorithm to focus on the structural similarities/dissimilarities rather than on the amplitude. In order to make meaningful comparisons between two time series, both must be normalized. Oracle Confidential – Internal/Restricted/Highly Restricted 11
- 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 12
- 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX? Data is divided into w equal sized frames. Mean value of the data falling within a frame is calculated Vector of these values becomes the PAA Oracle Confidential – Internal/Restricted/Highly Restricted 13 0 -- 0 20 40 60 80 100 120 bb b a c c c a
- 14. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX? Step 1: Reduce dimension by PAA Time series C of length n can be represented in a w-dimensional space by a vector Ć = ć1,…ćw The ith element is calculated by Oracle Confidential – Internal/Restricted/Highly Restricted 14 i ij jn w i w n w n cc 1)1(
- 15. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How to obtain SAX? Step 2: Discretization Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized areas under Gaussian curve Oracle Confidential – Internal/Restricted/Highly Restricted 15 0 -- 0 20 40 60 80 100 120 bb b a c c c a baabccbc Words: 8 Alphabet: 3
- 16. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16
- 17. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Gaussian distribution Most "natural" distributions A Gaussian process uses lazy learning and a measure of the similarity between points (this is the kernel function) to predict the value for an unseen point from training data Oracle Confidential – Internal/Restricted/Highly Restricted 17 Ref : https://www.isixsigma.com/tools-templates/normality/tips-recognizing-and-transforming- non-normal-data/
- 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Oracle Confidential – Internal/Restricted/Highly Restricted 18 • Given 2 time series Q and C – Euclidean distance – Distance after transforming the subsequence to PAA
- 19. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Oracle Confidential – Internal/Restricted/Highly Restricted 19 • Given 2 time series Q and C – Euclidean distance – Distance after transforming the subsequence to PAA
- 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Distance Measure Define MINDIST after transforming to symbolic representation MINDIST lower bounds the true distance between the original time series Oracle Confidential – Internal/Restricted/Highly Restricted 20 baabccbcCˆ babcaccaQˆ w i iiw n cqdistCQMINDIST 1 2 )ˆ,ˆ()ˆ,ˆ( dist() can be implemented using a table lookup.
- 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection Oracle Confidential – Internal/Restricted/Highly Restricted 21
- 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 22

No public clipboards found for this slide

Login to see the comments