SlideShare a Scribd company logo
1 of 41
SAX-VSM
Interpretable time series classification
with SAX, TF*IDF, and Vector Space Model
Pavel Senin
senin@hawaii.edu
University of Hawaii at Manoa
Department of Information and Computer Sciences
Collaborative Software Development Laboratory
http://csdl.ics.hawaii.edu
Temporal data
• Probably, the most of the collected data is temporal
• Smarter technology (monitoring, on-line adjustment):
– smarter house, power grid, water supply
– smarter traffic
– smarter cooking of your favorite food
• Health, personal and global
– blood pressure, heartbeat, sugar level, weight
– epidemiology
• Safety and Sustainability
– fraud detection, unusual activity mining
– civil infrastructure: bridges, buildings, roads
– weather, seismography, astronomy
– smarter agriculture
• Economy: money, stocks, markets, shopping
• Social networks, media, entertainment
http://www.imdb.com/title/tt0192618/
Problem definition
• Given a sequences of points, or given a live stream of points
• Find:
– patterns, outliers, (motifs, discords)
• Perform:
– classification, clustering, forecasting
• Gain domain-specific knowledge, infer a generative process
1
1 1 1 1
1 2 3
1 2 3
, , ,...,
...
, , ,..., m
k
m m m m
k
x x x x
x x x x
Real-life data:
- not equidistant
- compressed/stretched
- congested
- noise
- lost points
Similarity? Yes, you know when you see it!
But one need to teach machines to see
that difference too.
It turns out to be a quite difficult task.
All solutions are based on the similarity
in Time, Shape, or Change.
How many metrics out there?
Pseudoquasimetrics anyone?
http://blog.sfgate.com/pets/2009/03/18/pet-look-alike-photo-contest/, 02-13-2013
’’…Euclidean distance or Dynamic Time Warping (DTW) distance does not significantly
outperform random guessing. The reason for the poor performance of these otherwise very
competitive classifiers seems to be due to the fact that the data is somewhat noisy (i.e. insect
bites, and different stem lengths)…’’
“Time Series Shapelets: A New Primitive for Data Mining”, L. Ye, E. Keogh.
’’…However, it is clear that one-nearest-neighbor with Dynamic Time Warping
(DTW) distance is exceptionally difficult to beat. This approach has one weakness,
however; it is computationally too demanding for many realtime applications…’’
“Fast Time Series Classification Using Numerosity Reduction”, Xi, Keogh, Shelton, Wei
By far, the most ubiquitous distance measure for time series is the Euclidean
distance. 1-NN Euclidean Classifier is fast and accurate. Everyone does benchmark
with it because it is really hard to beat 1-NN Euclidean classifier.
State of the art
“…our basic message: transforming the data is the simplest way of achieving
improvement in problems where the discriminating features are based on similarity in
change and similarity in shape...”
“Transformation Based Ensembles for Time Series Classification”, Bagnall, A., Davis, L., Hills, J., Lines, J.
• Can we ignore the time?
• Can we step aside of mean, variance, kurtosis and skewness?
• Can we transform the temporal data in some feature space?
• Can we then actively learn from these features?
• SAX-VSM does it all, it doesn’t ignore time though, ordering is “loosely” kept.
(It keeps ordering within a sliding window, but not globally.)
Features what are they?
“Experiencing SAX: a Novel Symbolic Representation of Time Series", J.Lin, E.Keogh, L.Wei, S.Lonardi
Features versus Distance
• Dataset: electrical devices
– Kettle, Immersion heater, fridge/freezer, oven/cooker, computer/television,
and a dishwasher.
– Measured every 15 minutes – means that ~15 minutes of information are lost
“Classification of Household Devices by Electricity Usage Profiles", Lines et al. http://www.uea.ac.uk/cmp
Jmotif take on ED dataset: https://code.google.com/p/jmotif/wiki/ElectricalDevices
Error:
Euclidean 1NN: 46%
DTW 1NN: 33%
Shapelet Tree: 45%
Shapelet SVM: 75%
SAX-VSM: 32%
Distance
Features
Implementation and reproducibility
• There is a large difference in precision, sometimes.
• Which aligns with no free lunch theorems
– http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization
• My code is online here: https://github.com/jMotif/sax-vsm_classic, old location with more wiki
and docs: http://code.google.com/p/jmotif/
• Feel free to use it for your needs. Please, contribute your changes. It is GPL.
• All is reproducible. Most of the data available at UCR homepage, other datasets are online
too.
• Due to the active development things might change a bit.
From where I am coming to this:
yet another application problem - behaviors
• I work on Software Processes for my thesis, specifically – on recurrent behaviors
– Given a live stream (or a Git log) of telemetry from hundreds software project developers (linux kernel
as an example), find:
• What process they perform, what is their goal? What are their habits?
• Outliers? Clusters? i.e. roles and groups?
– Given a dozen of software project trails, are they similar or different in their software process?
– What about people who generate these artifacts – I mean here NO periodicity, TONS of lost values,
PLUS congestion, compression, you name it – data is corrupted.
• How I arrived to this method – I have realized, that behaviors of every single individual must
be counted in – they all knitting together the software. So, when I looked on all the trails ‐ in
SAX space ‐ the choice of TF*IDF was obvious.
• TF*IDF takes away similarities and highlights the behaviors which “stand out of the crowd”.
Moreover, it weights their importance, by counting their re‐occurrence. So you will see that
someone changing little things here or there.
• Vector Space Model, in turn, takes care about carefully counting these “selected behaviors” in
unknown temporal containers, pointing to class they should be assigned to – with a score!
http://www.darpa.mil/Our_Work/I2O/Programs/Active_Authentication.aspx
02-13-2013
Why behaviors? Yet another reason.
And many others things can be made “smart”.
“…The current standard method for validating a user’s identity for
authentication on an information system requires humans to do
something that is inherently unnatural: create, remember, and manage
long, complex passwords. Moreover, as long as the session remains
active, typical systems incorporate no mechanisms to verify that the
user originally authenticated is the user still in control of the
keyboard. ..”
SAX-VSM classification at large:
features
All this is well known since 1972, I wasn’t born yet. Thank you Gerald Salton!
All this is known since 2002, I wasn’t in Grad school yet. Thank you Jessica and Eamonn!
The only formulas. All other stuff is counting, hashing,
threading, and other Java magic.
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Cosine similarity
TF*IDF
http://nlp.stanford.edu/IR-book/newslides.html
IT IS NOT COMPLETELY NEW
Later, I found, that idea was exercised quite a
few times before ‐ with some success...
But I would argue that I came to it and built it
all alone, and made it working just fine.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390719/
This is that dataset: Synthetic Control
UCR Synthetic Control
JMotif error 0.7%
on par with 1-NN DTW
Error rate surface ,
one parameter is fixed,
two others varying.
Were they somewhere here?
Clustering:
UCR Synthetic Control,
No labels = No problem
k-Means works too
(if you know the K)
Another toy example: just three classes, CBF
CBF, Classification accuracy is 99.9%
http://jmotif.googlecode.com/svn/trunk/RCode/cbf/par_surface.gif
CBF: accuracy, runtime, features.
Classifying 30K of series (10K of each class)
Slow TFIDF stat
but slow growing!
Euclidean 1NN
Fast classification
Rate of absorbing (learning)
of new features,
Each class ~1/3 of the whole
as expected…
CBF, SAX-VSM features importance distribution
Another well-studied example: Gun/Point dataset
Slide created by Eamonn Keogh, eamonn@cs.ucr.edu
Slide by Lexiang Ye and Eamonn Keogh, www.cs.ucr.edu/~lexiangy/shapelet.html
Looking for the best
discriminating subseries
Shapelet‐style
Euclidean 91.3%, DTW 90.7%, Shapelet 93.3%, SAX-VSM 99.44%
Looking for the best
discriminating subseries
SAX‐VSM‐style
Gun/Point, SAX-VSM features importance distribution
Leaves classification
http://oregonstate.edu/dept/ldplants/Plant%20ID-Leaves.htm
Leaf attachment :
The pattern by which leaves are attached to a
stem or twig is also a useful characteristic in
plant identification. There are two large
groups, alternate and opposite patterns, and
a third less common pattern, whorled.
Leaf lobes:
Leaves may be lobed or not lobed. A lobe may
be defined as a curved or rounded
projection. With leaves there is no clear
distinction between shallow lobes and deep
teeth. A main vein is often visible in a lobe,
this may not occur in teeth.
Leaf margin:
Another important leaf characteristic for plant
identification is the edge or margin of a leaf or
leaflet. Leaves have either smooth edges,
called entire, or small notches or “teeth”
along the margin.
Leaves classification with SAX-VSM
Euclidean 51.7%, DTW 59.1%, Shapelet ?, SAX-VSM 92.2%
Moreover: SAX-VSM highlights same places as human experts do
Acer Circunatum Acer Glabrum
Acer Macrophyllum Acer Negundo
Quercus KelloggiiQuercus Garryana
Coffee spectrograms classification
Caffeine
Chlorogenic acid
Caffeine Chlorogenic Acid
Arabica 1.2% 5.5-8.0%
Robusta 2.2% 7.0-10.0%
http://code.google.com/p/jmotif/wiki/ArabicaRobusta
http://www.coffeechemistry.com/index.php/General/Agriculture/differences‐between‐arabica‐and‐robusta‐coffee.html
Euclidean 75.0%, DTW 82.1%, Shapelet ?, SAX-VSM 100%
SAX-VSM classification accuracy study
http://code.google.com/p/jmotif/wiki/Precision
• Shapelets can be different in their length. And it works better.
• What if words will be different in their length?
– I can use different SAX parameters and Strategies – aiming on picking up classes
specificities.
– But yes, search space grows too…
It can be better – TF*IDF and VSM
do not care about words length or their number!
Yoga dataset + Jmotif + {set of SAX params}*2 = -5% of error!
Figure from "Semi-Supervised Time Series Classification". Li Wei & Eamonn Keogh
Male or Female???
Euclidean 83%, DTW 83.6%, Shapelet ?, SAX-VSM 82% -> 87% (still running though)
Best SAX parameters search
Sampling the whole space with DIRECT
Slice of space with Sliding Window=42
Down from 432 points to 42
= 10X speedup
Results
• SAX-VSM, statistically speaking, at the level of current state of
the art.
• It is fast, if (“and”?) you can learn offline.
• Parameters search is the show stopper for now.
• But the best thing – it allows one to find unique temporal
patterns or discriminating features and weight them by
importance among hundreds of thousands of candidates.
• Still, efficient parameters search.
• Evaluation methodology. What is the proper way?
• TF*IDF implementations, how to chose a good one?
Open questions
• Ngrams
– http://books.google.com/ngrams/graph?content=Sherlock+Holmes%2CAlbert+Einstein%2CSputnik%2Ctime+series%2
CANOVA%2Cfeature+extraction%2Ctfidf%2BTFIDF%2CGoogle%2CINRA%2CNgram&year_start=1900&year_end=2008
&corpus=15&smoothing=4&share=
• Sequitur
– Extracting a grammar off the timeseries and its use as Vector Space part input. I hope it
will solve parameters search problem.
• Prediction !!!
– Grammars can tell what is the next word. Ngrams can help in that too.
• Multidimensional series
– all the dimensions into prefixed bags
– Ngrams spanning through dimensions
Where it goes… roadmap
Applications?
• I would be happy to try it on the real data and for the real problem.
• Right now I am looking on application to DNA contigs clustering -> next slides
• Discords, Motifs ->next slides
Some back-up slides
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Poppet pulled
significantly out of
the solenoid
before energizing
The De-
Energizing
phase is normal
Space Shuttle Marotta Valve Series
http://www.marotta.com/files/Brochures_WP_CS/marotta_spacebro_final_lr.pdf
Plot and Annotation by Eamonn Keogh
www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
instances2Discords [FINE|main|3:23:45] data size: 5000; max: 7.06; min: -3.1; mean: 1.0547679999999706; NaNs: 0
instances2Discords [FINE|main|3:23:45] window size: 128, alphabet size: 3, SAX Trie size: 4872
getDiscordsAlgorithm [FINE|main|3:23:45] starting motif finding routines
getDiscordsAlgorithm [FINE|main|3:23:50] new discord: bca, position 4297, distance 20.551788243362186, elapsed time: 0m 5s 20ms
getDiscordsAlgorithm [FINE|main|3:23:55] new discord: acc, position 4071, distance 10.507368842864514, elapsed time: 0m 5s 278ms
Jmotif implements this following papers by Keogh & Lin,
a bit different
Raw data
5 seconds later…
all discords found
(Intel Atom CPU)
TEK-17 a bit more difficult, probably 5.5 seconds …
(smaller window – larger trie structure)
Here it is annotated by Jmotif:
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Poppet pulled
significantly out of
the solenoid before
energizing
Space Shuttle Marotta Valve Series
Plot and Annotation by Eamonn Keogh
www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
Zoom into Discord Other energizing fragments
It is an outlier!
http://code.google.com/p/jmotif/wiki/Discords
Clustering DNA contigs/reads with TF*IDF
• As well as there is a successful way of converting DNA to timeseries:
http://www.cs.ucr.edu/~eamonn/UCRsuite.html
• There were multiple attempts to treat DNA strings with information
theory techniques.
• All those complexity things… - Kullback–Leibler divergence you
would probably know.
• kMer == Ngram
• And TF*IDF was applied too. And, in fact, it works. And it is fast.
• Set of 76’326 contigs clustered in 2 minutes (!)
• But I don’t know how good that clustering is. Seems to be OK, but
more work is needed.
The tree of these 70K+ contigs seems to be partitioned well by TFIDF
Just a random subset
…
might be a problem?
Clustalw
By percentage of identity
Clustalw
By BLOSUM62
In fact,
…
Hard to tell…

More Related Content

What's hot

Data-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing StrategiesData-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing StrategiesData Blueprint
 
VANETs Presentation
VANETs PresentationVANETs Presentation
VANETs PresentationiQra Rafaqat
 
Radar Technologies For Automotive 2018 report by Yole Développement
Radar Technologies For Automotive 2018 report by Yole Développement Radar Technologies For Automotive 2018 report by Yole Développement
Radar Technologies For Automotive 2018 report by Yole Développement Yole Developpement
 
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...IT Arena
 
MiL Testing of Highly Configurable Continuous Controllers
MiL Testing of Highly Configurable Continuous ControllersMiL Testing of Highly Configurable Continuous Controllers
MiL Testing of Highly Configurable Continuous ControllersLionel Briand
 
Credit Suisse, Reference Data Management on a Global Scale
Credit Suisse, Reference Data Management on a Global ScaleCredit Suisse, Reference Data Management on a Global Scale
Credit Suisse, Reference Data Management on a Global ScaleOrchestra Networks
 
Machine Vision In Electronic & Semiconductor Industry
Machine Vision In Electronic & Semiconductor IndustryMachine Vision In Electronic & Semiconductor Industry
Machine Vision In Electronic & Semiconductor IndustryFrancy Abraham
 
Driving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving IIDriving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving IIYu Huang
 
Embedded system in automobile
Embedded system in automobileEmbedded system in automobile
Embedded system in automobileAali Aalim
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?HEXANIKA
 
Embedded systems in automobiles
Embedded systems in automobilesEmbedded systems in automobiles
Embedded systems in automobilesTilak Marupilla
 
Essential Reference and Master Data Management
Essential Reference and Master Data ManagementEssential Reference and Master Data Management
Essential Reference and Master Data ManagementDATAVERSITY
 
collision avoidance system,automobile technology,safety systems in car
collision avoidance system,automobile technology,safety systems in carcollision avoidance system,automobile technology,safety systems in car
collision avoidance system,automobile technology,safety systems in carSai Ram Vakkalagadda
 
Prediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesPrediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesIRJET Journal
 

What's hot (20)

ADAS.ppt.pptx
ADAS.ppt.pptxADAS.ppt.pptx
ADAS.ppt.pptx
 
Data-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing StrategiesData-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing Strategies
 
VANETs Presentation
VANETs PresentationVANETs Presentation
VANETs Presentation
 
Radar Technologies For Automotive 2018 report by Yole Développement
Radar Technologies For Automotive 2018 report by Yole Développement Radar Technologies For Automotive 2018 report by Yole Développement
Radar Technologies For Automotive 2018 report by Yole Développement
 
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...
Computer Vision for Advanced Driver Assistance Systems (Olga Mirkina Technolo...
 
MiL Testing of Highly Configurable Continuous Controllers
MiL Testing of Highly Configurable Continuous ControllersMiL Testing of Highly Configurable Continuous Controllers
MiL Testing of Highly Configurable Continuous Controllers
 
Credit Suisse, Reference Data Management on a Global Scale
Credit Suisse, Reference Data Management on a Global ScaleCredit Suisse, Reference Data Management on a Global Scale
Credit Suisse, Reference Data Management on a Global Scale
 
Adaptive cruise control
Adaptive cruise controlAdaptive cruise control
Adaptive cruise control
 
Autonomous cars
Autonomous carsAutonomous cars
Autonomous cars
 
Machine Vision In Electronic & Semiconductor Industry
Machine Vision In Electronic & Semiconductor IndustryMachine Vision In Electronic & Semiconductor Industry
Machine Vision In Electronic & Semiconductor Industry
 
Driving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving IIDriving Behavior for ADAS and Autonomous Driving II
Driving Behavior for ADAS and Autonomous Driving II
 
Embedded system in automobile
Embedded system in automobileEmbedded system in automobile
Embedded system in automobile
 
Big data, Big decision
Big data, Big decisionBig data, Big decision
Big data, Big decision
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?
 
Embedded systems in automobiles
Embedded systems in automobilesEmbedded systems in automobiles
Embedded systems in automobiles
 
Essential Reference and Master Data Management
Essential Reference and Master Data ManagementEssential Reference and Master Data Management
Essential Reference and Master Data Management
 
collision avoidance system,automobile technology,safety systems in car
collision avoidance system,automobile technology,safety systems in carcollision avoidance system,automobile technology,safety systems in car
collision avoidance system,automobile technology,safety systems in car
 
Prediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesPrediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning Techniques
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Driverless car
Driverless carDriverless car
Driverless car
 

Viewers also liked

GrammarViz 2.0 demo slides
GrammarViz 2.0 demo slidesGrammarViz 2.0 demo slides
GrammarViz 2.0 demo slidesPavel Senin
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrievalotisg
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of BristolSimon Price
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...Simon Price
 
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule
 
Wmit introduction 2012 english slideshare
Wmit introduction 2012 english slideshareWmit introduction 2012 english slideshare
Wmit introduction 2012 english slidesharegmesmatch
 
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyA Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
 
춘천MBC 정보통신공사업 소개
춘천MBC 정보통신공사업 소개춘천MBC 정보통신공사업 소개
춘천MBC 정보통신공사업 소개chmbc
 
Book Design by Jason Gonzales
Book Design by Jason GonzalesBook Design by Jason Gonzales
Book Design by Jason GonzalesJason Gonzales
 
Thrust and lube - Startupfest 2012
Thrust and lube - Startupfest 2012Thrust and lube - Startupfest 2012
Thrust and lube - Startupfest 2012Alistair Croll
 
Ttss consulting(1)
Ttss consulting(1)Ttss consulting(1)
Ttss consulting(1)Steven Trom
 
OUMH1103: TOPIK 3: READING FOR INFORMATION
OUMH1103: TOPIK 3: READING FOR INFORMATIONOUMH1103: TOPIK 3: READING FOR INFORMATION
OUMH1103: TOPIK 3: READING FOR INFORMATIONRasidah Sukor
 

Viewers also liked (20)

GrammarViz 2.0 demo slides
GrammarViz 2.0 demo slidesGrammarViz 2.0 demo slides
GrammarViz 2.0 demo slides
 
SAX-TimeSeries
SAX-TimeSeriesSAX-TimeSeries
SAX-TimeSeries
 
Ir models
Ir modelsIr models
Ir models
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Oumh1103 bab 4
Oumh1103 bab 4Oumh1103 bab 4
Oumh1103 bab 4
 
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
 
Wmit introduction 2012 english slideshare
Wmit introduction 2012 english slideshareWmit introduction 2012 english slideshare
Wmit introduction 2012 english slideshare
 
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyA Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
 
춘천MBC 정보통신공사업 소개
춘천MBC 정보통신공사업 소개춘천MBC 정보통신공사업 소개
춘천MBC 정보통신공사업 소개
 
Book Design by Jason Gonzales
Book Design by Jason GonzalesBook Design by Jason Gonzales
Book Design by Jason Gonzales
 
Thrust and lube - Startupfest 2012
Thrust and lube - Startupfest 2012Thrust and lube - Startupfest 2012
Thrust and lube - Startupfest 2012
 
AM01PRO
AM01PROAM01PRO
AM01PRO
 
Ttss consulting(1)
Ttss consulting(1)Ttss consulting(1)
Ttss consulting(1)
 
OUMH1103: TOPIK 3: READING FOR INFORMATION
OUMH1103: TOPIK 3: READING FOR INFORMATIONOUMH1103: TOPIK 3: READING FOR INFORMATION
OUMH1103: TOPIK 3: READING FOR INFORMATION
 
Iltabloidmotori
IltabloidmotoriIltabloidmotori
Iltabloidmotori
 
Vocab dict
Vocab dictVocab dict
Vocab dict
 

Similar to SAX-VSM

MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerPoo Kuan Hoong
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningJulien TREGUER
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchYehia El-khatib
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningXavier Llorà
 
Multi-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleMulti-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleDaniel S. Katz
 
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest SystemsBig Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systemsaaamase
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
ELMSLN: Rethinking System Architecture
ELMSLN: Rethinking System ArchitectureELMSLN: Rethinking System Architecture
ELMSLN: Rethinking System ArchitectureBryan Ollendyke
 

Similar to SAX-VSM (20)

MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Spatio Temporal Data Mining
Spatio Temporal Data MiningSpatio Temporal Data Mining
Spatio Temporal Data Mining
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learning
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific Research
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Multi-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleMulti-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme Scale
 
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest SystemsBig Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
ELMSLN: Rethinking System Architecture
ELMSLN: Rethinking System ArchitectureELMSLN: Rethinking System Architecture
ELMSLN: Rethinking System Architecture
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 

Recently uploaded

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 

Recently uploaded (20)

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 

SAX-VSM

  • 1. SAX-VSM Interpretable time series classification with SAX, TF*IDF, and Vector Space Model Pavel Senin senin@hawaii.edu University of Hawaii at Manoa Department of Information and Computer Sciences Collaborative Software Development Laboratory http://csdl.ics.hawaii.edu
  • 2. Temporal data • Probably, the most of the collected data is temporal • Smarter technology (monitoring, on-line adjustment): – smarter house, power grid, water supply – smarter traffic – smarter cooking of your favorite food • Health, personal and global – blood pressure, heartbeat, sugar level, weight – epidemiology • Safety and Sustainability – fraud detection, unusual activity mining – civil infrastructure: bridges, buildings, roads – weather, seismography, astronomy – smarter agriculture • Economy: money, stocks, markets, shopping • Social networks, media, entertainment http://www.imdb.com/title/tt0192618/
  • 3. Problem definition • Given a sequences of points, or given a live stream of points • Find: – patterns, outliers, (motifs, discords) • Perform: – classification, clustering, forecasting • Gain domain-specific knowledge, infer a generative process 1 1 1 1 1 1 2 3 1 2 3 , , ,..., ... , , ,..., m k m m m m k x x x x x x x x Real-life data: - not equidistant - compressed/stretched - congested - noise - lost points
  • 4. Similarity? Yes, you know when you see it! But one need to teach machines to see that difference too. It turns out to be a quite difficult task. All solutions are based on the similarity in Time, Shape, or Change. How many metrics out there? Pseudoquasimetrics anyone? http://blog.sfgate.com/pets/2009/03/18/pet-look-alike-photo-contest/, 02-13-2013
  • 5. ’’…Euclidean distance or Dynamic Time Warping (DTW) distance does not significantly outperform random guessing. The reason for the poor performance of these otherwise very competitive classifiers seems to be due to the fact that the data is somewhat noisy (i.e. insect bites, and different stem lengths)…’’ “Time Series Shapelets: A New Primitive for Data Mining”, L. Ye, E. Keogh. ’’…However, it is clear that one-nearest-neighbor with Dynamic Time Warping (DTW) distance is exceptionally difficult to beat. This approach has one weakness, however; it is computationally too demanding for many realtime applications…’’ “Fast Time Series Classification Using Numerosity Reduction”, Xi, Keogh, Shelton, Wei By far, the most ubiquitous distance measure for time series is the Euclidean distance. 1-NN Euclidean Classifier is fast and accurate. Everyone does benchmark with it because it is really hard to beat 1-NN Euclidean classifier. State of the art “…our basic message: transforming the data is the simplest way of achieving improvement in problems where the discriminating features are based on similarity in change and similarity in shape...” “Transformation Based Ensembles for Time Series Classification”, Bagnall, A., Davis, L., Hills, J., Lines, J.
  • 6. • Can we ignore the time? • Can we step aside of mean, variance, kurtosis and skewness? • Can we transform the temporal data in some feature space? • Can we then actively learn from these features? • SAX-VSM does it all, it doesn’t ignore time though, ordering is “loosely” kept. (It keeps ordering within a sliding window, but not globally.) Features what are they? “Experiencing SAX: a Novel Symbolic Representation of Time Series", J.Lin, E.Keogh, L.Wei, S.Lonardi
  • 7. Features versus Distance • Dataset: electrical devices – Kettle, Immersion heater, fridge/freezer, oven/cooker, computer/television, and a dishwasher. – Measured every 15 minutes – means that ~15 minutes of information are lost “Classification of Household Devices by Electricity Usage Profiles", Lines et al. http://www.uea.ac.uk/cmp Jmotif take on ED dataset: https://code.google.com/p/jmotif/wiki/ElectricalDevices Error: Euclidean 1NN: 46% DTW 1NN: 33% Shapelet Tree: 45% Shapelet SVM: 75% SAX-VSM: 32% Distance Features
  • 8. Implementation and reproducibility • There is a large difference in precision, sometimes. • Which aligns with no free lunch theorems – http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization • My code is online here: https://github.com/jMotif/sax-vsm_classic, old location with more wiki and docs: http://code.google.com/p/jmotif/ • Feel free to use it for your needs. Please, contribute your changes. It is GPL. • All is reproducible. Most of the data available at UCR homepage, other datasets are online too. • Due to the active development things might change a bit.
  • 9. From where I am coming to this: yet another application problem - behaviors • I work on Software Processes for my thesis, specifically – on recurrent behaviors – Given a live stream (or a Git log) of telemetry from hundreds software project developers (linux kernel as an example), find: • What process they perform, what is their goal? What are their habits? • Outliers? Clusters? i.e. roles and groups? – Given a dozen of software project trails, are they similar or different in their software process? – What about people who generate these artifacts – I mean here NO periodicity, TONS of lost values, PLUS congestion, compression, you name it – data is corrupted. • How I arrived to this method – I have realized, that behaviors of every single individual must be counted in – they all knitting together the software. So, when I looked on all the trails ‐ in SAX space ‐ the choice of TF*IDF was obvious. • TF*IDF takes away similarities and highlights the behaviors which “stand out of the crowd”. Moreover, it weights their importance, by counting their re‐occurrence. So you will see that someone changing little things here or there. • Vector Space Model, in turn, takes care about carefully counting these “selected behaviors” in unknown temporal containers, pointing to class they should be assigned to – with a score!
  • 10. http://www.darpa.mil/Our_Work/I2O/Programs/Active_Authentication.aspx 02-13-2013 Why behaviors? Yet another reason. And many others things can be made “smart”. “…The current standard method for validating a user’s identity for authentication on an information system requires humans to do something that is inherently unnatural: create, remember, and manage long, complex passwords. Moreover, as long as the session remains active, typical systems incorporate no mechanisms to verify that the user originally authenticated is the user still in control of the keyboard. ..”
  • 11. SAX-VSM classification at large: features All this is well known since 1972, I wasn’t born yet. Thank you Gerald Salton! All this is known since 2002, I wasn’t in Grad school yet. Thank you Jessica and Eamonn!
  • 12. The only formulas. All other stuff is counting, hashing, threading, and other Java magic. http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Cosine similarity TF*IDF http://nlp.stanford.edu/IR-book/newslides.html
  • 13. IT IS NOT COMPLETELY NEW Later, I found, that idea was exercised quite a few times before ‐ with some success... But I would argue that I came to it and built it all alone, and made it working just fine. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390719/
  • 14. This is that dataset: Synthetic Control
  • 15. UCR Synthetic Control JMotif error 0.7% on par with 1-NN DTW Error rate surface , one parameter is fixed, two others varying. Were they somewhere here?
  • 16. Clustering: UCR Synthetic Control, No labels = No problem k-Means works too (if you know the K)
  • 17. Another toy example: just three classes, CBF
  • 18. CBF, Classification accuracy is 99.9% http://jmotif.googlecode.com/svn/trunk/RCode/cbf/par_surface.gif
  • 19. CBF: accuracy, runtime, features. Classifying 30K of series (10K of each class) Slow TFIDF stat but slow growing! Euclidean 1NN Fast classification Rate of absorbing (learning) of new features, Each class ~1/3 of the whole as expected…
  • 20. CBF, SAX-VSM features importance distribution
  • 21. Another well-studied example: Gun/Point dataset Slide created by Eamonn Keogh, eamonn@cs.ucr.edu
  • 22. Slide by Lexiang Ye and Eamonn Keogh, www.cs.ucr.edu/~lexiangy/shapelet.html Looking for the best discriminating subseries Shapelet‐style Euclidean 91.3%, DTW 90.7%, Shapelet 93.3%, SAX-VSM 99.44%
  • 23. Looking for the best discriminating subseries SAX‐VSM‐style
  • 24. Gun/Point, SAX-VSM features importance distribution
  • 25. Leaves classification http://oregonstate.edu/dept/ldplants/Plant%20ID-Leaves.htm Leaf attachment : The pattern by which leaves are attached to a stem or twig is also a useful characteristic in plant identification. There are two large groups, alternate and opposite patterns, and a third less common pattern, whorled. Leaf lobes: Leaves may be lobed or not lobed. A lobe may be defined as a curved or rounded projection. With leaves there is no clear distinction between shallow lobes and deep teeth. A main vein is often visible in a lobe, this may not occur in teeth. Leaf margin: Another important leaf characteristic for plant identification is the edge or margin of a leaf or leaflet. Leaves have either smooth edges, called entire, or small notches or “teeth” along the margin.
  • 26. Leaves classification with SAX-VSM Euclidean 51.7%, DTW 59.1%, Shapelet ?, SAX-VSM 92.2% Moreover: SAX-VSM highlights same places as human experts do
  • 27. Acer Circunatum Acer Glabrum Acer Macrophyllum Acer Negundo Quercus KelloggiiQuercus Garryana
  • 28. Coffee spectrograms classification Caffeine Chlorogenic acid Caffeine Chlorogenic Acid Arabica 1.2% 5.5-8.0% Robusta 2.2% 7.0-10.0% http://code.google.com/p/jmotif/wiki/ArabicaRobusta http://www.coffeechemistry.com/index.php/General/Agriculture/differences‐between‐arabica‐and‐robusta‐coffee.html Euclidean 75.0%, DTW 82.1%, Shapelet ?, SAX-VSM 100%
  • 29. SAX-VSM classification accuracy study http://code.google.com/p/jmotif/wiki/Precision
  • 30. • Shapelets can be different in their length. And it works better. • What if words will be different in their length? – I can use different SAX parameters and Strategies – aiming on picking up classes specificities. – But yes, search space grows too… It can be better – TF*IDF and VSM do not care about words length or their number! Yoga dataset + Jmotif + {set of SAX params}*2 = -5% of error! Figure from "Semi-Supervised Time Series Classification". Li Wei & Eamonn Keogh Male or Female??? Euclidean 83%, DTW 83.6%, Shapelet ?, SAX-VSM 82% -> 87% (still running though)
  • 31. Best SAX parameters search Sampling the whole space with DIRECT Slice of space with Sliding Window=42 Down from 432 points to 42 = 10X speedup
  • 32. Results • SAX-VSM, statistically speaking, at the level of current state of the art. • It is fast, if (“and”?) you can learn offline. • Parameters search is the show stopper for now. • But the best thing – it allows one to find unique temporal patterns or discriminating features and weight them by importance among hundreds of thousands of candidates. • Still, efficient parameters search. • Evaluation methodology. What is the proper way? • TF*IDF implementations, how to chose a good one? Open questions
  • 33. • Ngrams – http://books.google.com/ngrams/graph?content=Sherlock+Holmes%2CAlbert+Einstein%2CSputnik%2Ctime+series%2 CANOVA%2Cfeature+extraction%2Ctfidf%2BTFIDF%2CGoogle%2CINRA%2CNgram&year_start=1900&year_end=2008 &corpus=15&smoothing=4&share= • Sequitur – Extracting a grammar off the timeseries and its use as Vector Space part input. I hope it will solve parameters search problem. • Prediction !!! – Grammars can tell what is the next word. Ngrams can help in that too. • Multidimensional series – all the dimensions into prefixed bags – Ngrams spanning through dimensions Where it goes… roadmap Applications? • I would be happy to try it on the real data and for the real problem. • Right now I am looking on application to DNA contigs clustering -> next slides • Discords, Motifs ->next slides
  • 35. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Poppet pulled significantly out of the solenoid before energizing The De- Energizing phase is normal Space Shuttle Marotta Valve Series http://www.marotta.com/files/Brochures_WP_CS/marotta_spacebro_final_lr.pdf Plot and Annotation by Eamonn Keogh www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
  • 36. instances2Discords [FINE|main|3:23:45] data size: 5000; max: 7.06; min: -3.1; mean: 1.0547679999999706; NaNs: 0 instances2Discords [FINE|main|3:23:45] window size: 128, alphabet size: 3, SAX Trie size: 4872 getDiscordsAlgorithm [FINE|main|3:23:45] starting motif finding routines getDiscordsAlgorithm [FINE|main|3:23:50] new discord: bca, position 4297, distance 20.551788243362186, elapsed time: 0m 5s 20ms getDiscordsAlgorithm [FINE|main|3:23:55] new discord: acc, position 4071, distance 10.507368842864514, elapsed time: 0m 5s 278ms Jmotif implements this following papers by Keogh & Lin, a bit different Raw data 5 seconds later… all discords found (Intel Atom CPU)
  • 37. TEK-17 a bit more difficult, probably 5.5 seconds … (smaller window – larger trie structure) Here it is annotated by Jmotif: 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Poppet pulled significantly out of the solenoid before energizing Space Shuttle Marotta Valve Series Plot and Annotation by Eamonn Keogh www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt Zoom into Discord Other energizing fragments It is an outlier! http://code.google.com/p/jmotif/wiki/Discords
  • 38. Clustering DNA contigs/reads with TF*IDF • As well as there is a successful way of converting DNA to timeseries: http://www.cs.ucr.edu/~eamonn/UCRsuite.html • There were multiple attempts to treat DNA strings with information theory techniques. • All those complexity things… - Kullback–Leibler divergence you would probably know. • kMer == Ngram • And TF*IDF was applied too. And, in fact, it works. And it is fast. • Set of 76’326 contigs clustered in 2 minutes (!) • But I don’t know how good that clustering is. Seems to be OK, but more work is needed.
  • 39. The tree of these 70K+ contigs seems to be partitioned well by TFIDF
  • 40. Just a random subset … might be a problem? Clustalw By percentage of identity