This document describes the SAX-VSM (Symbolic Aggregate approXimation - Vector Space Model) method for interpretable time series classification. SAX-VSM transforms time series data into symbolic representations called "words", then applies TF-IDF (Term Frequency - Inverse Document Frequency) to select discriminative words and create feature vectors, allowing classification using techniques like k-NN. The method is shown to achieve high accuracy on benchmark datasets like Gun/Point and Coffee Spectrograms, outperforming Euclidean and DTW distance measures. Open questions remain around efficient parameter searching and evaluation methodology.
Student Profile Sample - We help schools to connect the data they have, with ...
SAX-VSM
1. SAX-VSM
Interpretable time series classification
with SAX, TF*IDF, and Vector Space Model
Pavel Senin
senin@hawaii.edu
University of Hawaii at Manoa
Department of Information and Computer Sciences
Collaborative Software Development Laboratory
http://csdl.ics.hawaii.edu
2. Temporal data
• Probably, the most of the collected data is temporal
• Smarter technology (monitoring, on-line adjustment):
– smarter house, power grid, water supply
– smarter traffic
– smarter cooking of your favorite food
• Health, personal and global
– blood pressure, heartbeat, sugar level, weight
– epidemiology
• Safety and Sustainability
– fraud detection, unusual activity mining
– civil infrastructure: bridges, buildings, roads
– weather, seismography, astronomy
– smarter agriculture
• Economy: money, stocks, markets, shopping
• Social networks, media, entertainment
http://www.imdb.com/title/tt0192618/
3. Problem definition
• Given a sequences of points, or given a live stream of points
• Find:
– patterns, outliers, (motifs, discords)
• Perform:
– classification, clustering, forecasting
• Gain domain-specific knowledge, infer a generative process
1
1 1 1 1
1 2 3
1 2 3
, , ,...,
...
, , ,..., m
k
m m m m
k
x x x x
x x x x
Real-life data:
- not equidistant
- compressed/stretched
- congested
- noise
- lost points
4. Similarity? Yes, you know when you see it!
But one need to teach machines to see
that difference too.
It turns out to be a quite difficult task.
All solutions are based on the similarity
in Time, Shape, or Change.
How many metrics out there?
Pseudoquasimetrics anyone?
http://blog.sfgate.com/pets/2009/03/18/pet-look-alike-photo-contest/, 02-13-2013
5. ’’…Euclidean distance or Dynamic Time Warping (DTW) distance does not significantly
outperform random guessing. The reason for the poor performance of these otherwise very
competitive classifiers seems to be due to the fact that the data is somewhat noisy (i.e. insect
bites, and different stem lengths)…’’
“Time Series Shapelets: A New Primitive for Data Mining”, L. Ye, E. Keogh.
’’…However, it is clear that one-nearest-neighbor with Dynamic Time Warping
(DTW) distance is exceptionally difficult to beat. This approach has one weakness,
however; it is computationally too demanding for many realtime applications…’’
“Fast Time Series Classification Using Numerosity Reduction”, Xi, Keogh, Shelton, Wei
By far, the most ubiquitous distance measure for time series is the Euclidean
distance. 1-NN Euclidean Classifier is fast and accurate. Everyone does benchmark
with it because it is really hard to beat 1-NN Euclidean classifier.
State of the art
“…our basic message: transforming the data is the simplest way of achieving
improvement in problems where the discriminating features are based on similarity in
change and similarity in shape...”
“Transformation Based Ensembles for Time Series Classification”, Bagnall, A., Davis, L., Hills, J., Lines, J.
6. • Can we ignore the time?
• Can we step aside of mean, variance, kurtosis and skewness?
• Can we transform the temporal data in some feature space?
• Can we then actively learn from these features?
• SAX-VSM does it all, it doesn’t ignore time though, ordering is “loosely” kept.
(It keeps ordering within a sliding window, but not globally.)
Features what are they?
“Experiencing SAX: a Novel Symbolic Representation of Time Series", J.Lin, E.Keogh, L.Wei, S.Lonardi
7. Features versus Distance
• Dataset: electrical devices
– Kettle, Immersion heater, fridge/freezer, oven/cooker, computer/television,
and a dishwasher.
– Measured every 15 minutes – means that ~15 minutes of information are lost
“Classification of Household Devices by Electricity Usage Profiles", Lines et al. http://www.uea.ac.uk/cmp
Jmotif take on ED dataset: https://code.google.com/p/jmotif/wiki/ElectricalDevices
Error:
Euclidean 1NN: 46%
DTW 1NN: 33%
Shapelet Tree: 45%
Shapelet SVM: 75%
SAX-VSM: 32%
Distance
Features
8. Implementation and reproducibility
• There is a large difference in precision, sometimes.
• Which aligns with no free lunch theorems
– http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization
• My code is online here: https://github.com/jMotif/sax-vsm_classic, old location with more wiki
and docs: http://code.google.com/p/jmotif/
• Feel free to use it for your needs. Please, contribute your changes. It is GPL.
• All is reproducible. Most of the data available at UCR homepage, other datasets are online
too.
• Due to the active development things might change a bit.
9. From where I am coming to this:
yet another application problem - behaviors
• I work on Software Processes for my thesis, specifically – on recurrent behaviors
– Given a live stream (or a Git log) of telemetry from hundreds software project developers (linux kernel
as an example), find:
• What process they perform, what is their goal? What are their habits?
• Outliers? Clusters? i.e. roles and groups?
– Given a dozen of software project trails, are they similar or different in their software process?
– What about people who generate these artifacts – I mean here NO periodicity, TONS of lost values,
PLUS congestion, compression, you name it – data is corrupted.
• How I arrived to this method – I have realized, that behaviors of every single individual must
be counted in – they all knitting together the software. So, when I looked on all the trails ‐ in
SAX space ‐ the choice of TF*IDF was obvious.
• TF*IDF takes away similarities and highlights the behaviors which “stand out of the crowd”.
Moreover, it weights their importance, by counting their re‐occurrence. So you will see that
someone changing little things here or there.
• Vector Space Model, in turn, takes care about carefully counting these “selected behaviors” in
unknown temporal containers, pointing to class they should be assigned to – with a score!
10. http://www.darpa.mil/Our_Work/I2O/Programs/Active_Authentication.aspx
02-13-2013
Why behaviors? Yet another reason.
And many others things can be made “smart”.
“…The current standard method for validating a user’s identity for
authentication on an information system requires humans to do
something that is inherently unnatural: create, remember, and manage
long, complex passwords. Moreover, as long as the session remains
active, typical systems incorporate no mechanisms to verify that the
user originally authenticated is the user still in control of the
keyboard. ..”
11. SAX-VSM classification at large:
features
All this is well known since 1972, I wasn’t born yet. Thank you Gerald Salton!
All this is known since 2002, I wasn’t in Grad school yet. Thank you Jessica and Eamonn!
12. The only formulas. All other stuff is counting, hashing,
threading, and other Java magic.
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Cosine similarity
TF*IDF
http://nlp.stanford.edu/IR-book/newslides.html
13. IT IS NOT COMPLETELY NEW
Later, I found, that idea was exercised quite a
few times before ‐ with some success...
But I would argue that I came to it and built it
all alone, and made it working just fine.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390719/
15. UCR Synthetic Control
JMotif error 0.7%
on par with 1-NN DTW
Error rate surface ,
one parameter is fixed,
two others varying.
Were they somewhere here?
19. CBF: accuracy, runtime, features.
Classifying 30K of series (10K of each class)
Slow TFIDF stat
but slow growing!
Euclidean 1NN
Fast classification
Rate of absorbing (learning)
of new features,
Each class ~1/3 of the whole
as expected…
22. Slide by Lexiang Ye and Eamonn Keogh, www.cs.ucr.edu/~lexiangy/shapelet.html
Looking for the best
discriminating subseries
Shapelet‐style
Euclidean 91.3%, DTW 90.7%, Shapelet 93.3%, SAX-VSM 99.44%
25. Leaves classification
http://oregonstate.edu/dept/ldplants/Plant%20ID-Leaves.htm
Leaf attachment :
The pattern by which leaves are attached to a
stem or twig is also a useful characteristic in
plant identification. There are two large
groups, alternate and opposite patterns, and
a third less common pattern, whorled.
Leaf lobes:
Leaves may be lobed or not lobed. A lobe may
be defined as a curved or rounded
projection. With leaves there is no clear
distinction between shallow lobes and deep
teeth. A main vein is often visible in a lobe,
this may not occur in teeth.
Leaf margin:
Another important leaf characteristic for plant
identification is the edge or margin of a leaf or
leaflet. Leaves have either smooth edges,
called entire, or small notches or “teeth”
along the margin.
26. Leaves classification with SAX-VSM
Euclidean 51.7%, DTW 59.1%, Shapelet ?, SAX-VSM 92.2%
Moreover: SAX-VSM highlights same places as human experts do
30. • Shapelets can be different in their length. And it works better.
• What if words will be different in their length?
– I can use different SAX parameters and Strategies – aiming on picking up classes
specificities.
– But yes, search space grows too…
It can be better – TF*IDF and VSM
do not care about words length or their number!
Yoga dataset + Jmotif + {set of SAX params}*2 = -5% of error!
Figure from "Semi-Supervised Time Series Classification". Li Wei & Eamonn Keogh
Male or Female???
Euclidean 83%, DTW 83.6%, Shapelet ?, SAX-VSM 82% -> 87% (still running though)
31. Best SAX parameters search
Sampling the whole space with DIRECT
Slice of space with Sliding Window=42
Down from 432 points to 42
= 10X speedup
32. Results
• SAX-VSM, statistically speaking, at the level of current state of
the art.
• It is fast, if (“and”?) you can learn offline.
• Parameters search is the show stopper for now.
• But the best thing – it allows one to find unique temporal
patterns or discriminating features and weight them by
importance among hundreds of thousands of candidates.
• Still, efficient parameters search.
• Evaluation methodology. What is the proper way?
• TF*IDF implementations, how to chose a good one?
Open questions
35. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Poppet pulled
significantly out of
the solenoid
before energizing
The De-
Energizing
phase is normal
Space Shuttle Marotta Valve Series
http://www.marotta.com/files/Brochures_WP_CS/marotta_spacebro_final_lr.pdf
Plot and Annotation by Eamonn Keogh
www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
36. instances2Discords [FINE|main|3:23:45] data size: 5000; max: 7.06; min: -3.1; mean: 1.0547679999999706; NaNs: 0
instances2Discords [FINE|main|3:23:45] window size: 128, alphabet size: 3, SAX Trie size: 4872
getDiscordsAlgorithm [FINE|main|3:23:45] starting motif finding routines
getDiscordsAlgorithm [FINE|main|3:23:50] new discord: bca, position 4297, distance 20.551788243362186, elapsed time: 0m 5s 20ms
getDiscordsAlgorithm [FINE|main|3:23:55] new discord: acc, position 4071, distance 10.507368842864514, elapsed time: 0m 5s 278ms
Jmotif implements this following papers by Keogh & Lin,
a bit different
Raw data
5 seconds later…
all discords found
(Intel Atom CPU)
37. TEK-17 a bit more difficult, probably 5.5 seconds …
(smaller window – larger trie structure)
Here it is annotated by Jmotif:
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Poppet pulled
significantly out of
the solenoid before
energizing
Space Shuttle Marotta Valve Series
Plot and Annotation by Eamonn Keogh
www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
Zoom into Discord Other energizing fragments
It is an outlier!
http://code.google.com/p/jmotif/wiki/Discords
38. Clustering DNA contigs/reads with TF*IDF
• As well as there is a successful way of converting DNA to timeseries:
http://www.cs.ucr.edu/~eamonn/UCRsuite.html
• There were multiple attempts to treat DNA strings with information
theory techniques.
• All those complexity things… - Kullback–Leibler divergence you
would probably know.
• kMer == Ngram
• And TF*IDF was applied too. And, in fact, it works. And it is fast.
• Set of 76’326 contigs clustered in 2 minutes (!)
• But I don’t know how good that clustering is. Seems to be OK, but
more work is needed.
39. The tree of these 70K+ contigs seems to be partitioned well by TFIDF
40. Just a random subset
…
might be a problem?
Clustalw
By percentage of identity