2. Curriculum Vitae
Basic Information
Birthday – Aug. 31, 1978
Education
Depart. of CSE, YZU (B. S. 2000)
Depart. of CS, NTUST (M. S. 2002)
Depart. of CSIE, NCTU (Ph. D. 2012)
Advisor: Prof. Suh-Yin Lee ( 李素瑛 教授 ), Wen-Chih Peng ( 彭文志
教授 )
Ph. D. Dissertation:
A Study on Time Interval-based Sequential Patterns Mining
2
4. Why Data Mining?
Commercial Viewpoint
Lots of data is being collected
Web data, e-commerce
purchases at department
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
4
5. Why Data Mining?
Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and analyzing data
in Hypothesis Formation
5
6. Data Mining
We are buried in data, but looking for knowledge
Data mining
Knowledge discovery in databases
Extraction of interesting knowledge (rules,
regularities, patterns) from data in large databases
6
8. Sequential Pattern Mining
Point-based sequential pattern mining
Customer analysis, network intrusion detection, finding tandem
repeats in DNA sequence…
Simple relation between point
time point-based
beer
milk
diaper
beer
milk
diaper
Three relation
(before, equal, after )
with min_sup = 2, 〈(ab)dc〉 is
a frequent sequential pattern
8
9. Interval Data Everywhere !!
Interval data
Data has duration time
Clinical data, library data, appliance usage data
Applications
Diagnosis System, recommendation system, Smart
home
DB
Diagnosis System
Recommendation
Smart Home
9
13. Motivation
Representation
Allen’s relations are binary relation
Express the relation more than 3 intervals
Efficient algorithms
Ambiguous problem
Space usage
Mining temporal pattern *
Mining closed temporal pattern
Incrementally maintain discovered temporal pattern
and closed temporal pattern
Related applications
Social network
Smart home
13
14. Proposed Method
Coincidence representation
Segment intervals into disjoint slices
Nonambiguous and compact representation
Endpoint representation
Global information of a sequence
Nonambiguous and compact representation
TPMiner (Temporal Pattern Miner)
Pattern-growth approach
Without candidate generation and test
Two components
RPrefixSpan
Pruning strategies
14
15. Coincidence representation
Segment intervals into disjoint slices
Four kinds of event slice
Start slice (+), intermediate slice (*), finish slice (-) and intact
slice ( )
Coincidence
Slices occurring simultaneously
Space usage (for a k-pattern)
Best: k, Worst: 2k space
A
event intervals
coincidence
B
(A+) (A−B+) (B−)
coincidence representation:
C
D
E
(C+) (C*D) (C−)
(E)
(A+) (A−B+) (B−) (C+) (C*D ) (C−) (E)
15
16. Incision strategy
A data structure, endtime_list
Sort and merge
(A, 1, 4)
Trace endtime_list one-by-one
(B, 2, 5)
(C, 2, 8)
endtime_list
symbol time type
1
4
2
5
2
8
3
5
5
7
s
f
s
f
s
f
s
f
s
f
endtime_list
symbol time type
sort
merge
A
BC
D
A
BD
E
E
C
1
2
3
4
5
5
7
8
s
s
s
f
f
s
f
f
(E, 5, 7)
coincidence representation:
(A+) (B+ C+) (A− D+ ) (B− D− ) @ (E) (C− )
…
A
A
B
B
C
C
D
D
E
E
(D, 3, 5)
trace one- by- one
16
17. Endpoint Representation
time points of events
A
B
C
D
A+ ( B+ C+ ) A− ( B− C− D+ ) D−
Sequence of ordered time points
+: start time, − : finish time
Nonambiguous
Space usage (for a k-pattern)
2k space
17
19. TPMiner – RPrefixSpan (1/2)
Every item is disjoint
The relations among slices are simple
Before, equal and after (like time-point data)
RPrefixSpan
Borrow the idea of PrefixSpan
Scan local database to find frequent slices
Append and extend the pattern
Project database
Pruning strategy
Reduce search space
Pre-pruning and post-pruning
19
20. TPMiner – RPrefixSpan (2/2)
..
D |e1...
D |e2
…
scan
database
frequent items:
e1, e2, ..., ei, ..., en
…
D |ei
D |en
transform sequences
and project database
..
..
D |ei...
..
..
collect all mining
patterns
Frequent
temporal patterns
..
D |en...
..
..
..
D |e2...
…
D
…
D |e1
..
..
recursively project database and
append & extend pattern
20
21. Pruning Strategy – Pre-pruning
Utilize the concept of slice and coincidence
Start
slices and finish slices occur in pairs
Only require projecting the frequent finish slices which
have the corresponding start slices in their prefixes
Non-qualified pattern
〈 A+ A− 〉
D|〈A+ A−〉
〈A B 〉
+
…
〈 A+ 〉
D| 〈A+〉
scan
database
frequent local slice :
A− , B+, B− , C
+
D|〈A+ B+〉
Non-promising projection
can be pre-pruning !
〈 A+ B− 〉
D|〈A+ B−〉
〈 A+ C 〉
D|〈A+ C〉
21
22. Pruning Strategy – Post-pruning
Utilize the concept of slice and coincidence
Start
slice always appear before finish slice
Only collect the significant postfixes
With respect to a prefix α, all finish slices in postfix
have corresponding start slices in α
…
...
S1: 〈(B + )(D + )(E)(D - )(B
-
)〉
S2: 〈(B + )(B - D + )(E)(D - )〉
S3: 〈(B)(A)(D + )(E)(D - )〉
〈 E〉
...
A coincidence database D
Projected database
can be post-pruning
D |〈 E 〉
S1: 〈(D - )(B
-
)〉
S2: 〈(D - )〉
S3: 〈(D - )〉
…
Insignificant sequences
22
23. Experimental Results (1/2)
D200k – C40 – N10k
70000
IEMiner
TPMiner-CR
TPMiner-ER
40000
30000
20000
10000
3500
3000
2500
2000
1500
1000
500
0
0
1
0.9
0.8
0.7
0.6
0.5
minimum support (%)
(a) The performance of six algorithms
1
0.9
0.8
0.7
0.6
0.5
minimum support (%)
(b) The number of temporal patterns
N10k – C20 – N10k
2500
H-DFS
ARMADA
2000
memory usage (MB)
execution time (sec)
50000
number of generated patterns
H-DFS
ARMADA
TPrefixSpan
60000
D200k – C40 – N10k
4000
TPrefixSpan
IEMiner
1500
TPMiner-CR
TPMiner-ER
1000
500
0
1
0.9
0.8
0.7
0.6
0.5
minimum support (%)
23
24. Experimental Results (2/2)
7000
6000
TPMiner-CR
5000
TPMiner-CR without
pre-pruning strategy
TPMiner-CR
5000
execution time (sec)
execution time (sec)
6000
4000
3000
2000
TPMiner-CR without
post-pruning strategy
4000
3000
2000
1000
1000
0
0
1
0.9
0.8
0.7
0.6
1
0.5
minimum support (%)
0.8
0.7
0.6
0.5
minimum support (%)
(a) The performance test of influence
on pre-pruning strategies
(b) The performance test of influence
on post-pruning strategies
6000
8000
4000
7000
TPMiner-CR
TPMiner-CR without
subset-pruning strategy
6000
TPMiner-CR without
any pruning strategy
execution time (sec)
TPMiner-CR
5000
execution time (sec)
0.9
3000
2000
1000
5000
4000
3000
2000
1000
0
0
1
0.9
0.8
0.7
0.6
0.5
minimum support (%)
(c) The performance test of influence
on subset-pruning strategies
1
0.9
0.8
0.7
0.6
0.5
minimum support (%)
(b) The performance test of influence
on all proposed pruning strategies
24
26. Smart Home Application
Home
Home Server
light
Air Conditioner
Cloud
Database
Current Behavior
(1) Sensor data log
P1:
P2:
P3:
(3) Behavior Detection
Alarm
D-Link controler
light
ID 2
Air Conditioner
…
Usage
Patterns
(2) Pattern Mining
ID 2
Usage Pattern
Light
ID 3
ID 4
on
off
on
off
on
off
on
off
light
Air Conditioner
Current Behavior
(4) Abnormal Detection
Remote Control
(5) System Alarm &
Remote Control
26
27. Dynamic Social Network (1/2)
Dynamic social network
A sequence of interaction graph
…
Nodes and edges vary with time
A lossless transformation
Graph sequence interval sequence
t1
SID
A
A
A
E
B
E
B
B
D
C
D
C
G2
start
time
finish
time
B
D
E
1
2
4
3
4
6
A
C
1
1
3
3
B
D
E
1
2
4
3
4
6
A
C
1
1
3
3
E
B
D
C
G1
E
event
symbol
A
A
D
….
B
C
G3
G4
C
D
t2
t3
event sequence
B
E
D
A
C
B
E
D
A
C
27
28. Dynamic Social Network (2/2)
Reduce the complexity of graph
Avoid
isomorphism testing
Dynamic Social Network Analysis
Pattern
mining
Classification
Recommending system
Network sampling
Clustering
28
32. Advertisement Budget
According to
, advertisement spending on
worldwide social networking sites
2008, $23.3 millions
2010, $23.6 billions
2011, almost $25.5 billions
Advertisement spending
32
33. Influence Maximization
Word-of-mouth effect in social network
Influence maximization problem
Select initial users (seeds) so that the number of users
that adopt the product or innovation is maximized
social network
social network
Seeds select
33
34. Motivation
Characteristic of social network
Community structure
10
5
4
6
1
9
7
2
11
3
10
5
12
4
8
6
1
9
7
2
11
3
12
8
Community and degree heuristic (CDH)
Utilize community information
Avoid influence overlapping
34
36. CDH – Adjust Step
Adjust selected fundamental nodes
Seeds selected from large community may activate more
inactive nodes than small community
Replace the fundamental node in small community
If we can activate more inactive nodes
Finally, output the result as selected seed nodes
second largest
degree node
in C1
replace!!
largest degree
node in Ck
C1
C2
C3
……
Ck
delete!!
36
39. Recommendation System
predict the ratings or preferences
using a model build from the characteristics
(a) amazon.com
(b) youtube.com
39
40. Collaborative Filtering (CF)
Calculate the similarity between the active user and the
other users
•
Person’s correlation, cosine similarity, conditional
probability, etc.
2. Predict the rating of items that have not been rated by the
active user
3. Output the top-k items by the predicting results
1.
40
42. Motivation
Dynamic! Dynamic! Dynamic!
Why we need dynamic
All things vary with time
Dynamic Collaborative Filtering
consider the time influence in the calculation.
Without considering about the time
the results of prediction might be out of date.
42
44. Advanced DSCF
α (similarity decay value, SDV) might not be
consistent for all time.
each user might have his/her own SDV in different
time points.
feedback predicted values from actual values
44
45. ?
Active
user
Predict
pa ,i = ra
∑
+
k
j =1
(rj ,i − rj ) ⋅ [α ⋅ sim′ , j + (1 − α ) sim′′, j ]
a
a
∑
k
j =1
′
[α ⋅ sim′ , j + (1 − α ) sima′, j ]
a
Recommend
Aa,i
Active
user
∑
+
k
Feedback
Aa ,i = ra
j =1
′
′
( r j ,i − rj ) ⋅[α ⋅ sima , j + (1 −α ) sima′, j ]
∑
k
j =1
′
[α ⋅ sima , j + (1 −α ) sim′′, j ]
a
45
Here is Allen’s 13 temporal relation.
That is ……
For two intervals, we can describe their relationship by their start times and finish times.
For example, interval A meets interval B, if A’ finish time is equal to B’s start time.
By our observation, there are three important issues for mining closed temporal patterns.
Complex relationship.
Since an interval has duration time, the relation between any two intervals is complicated,
Representation.
Since all Allen’s temporal relations are binary relations.
That means it can describe the relation between any two intervals easily.
However, when we want to express the relation more than 3 intervals, here comes the problem.
How can we express a pattern non-ambiguously?
And what is the space usage to describe a pattern correctly?
The last issue is how to design an efficient algorithm?
Could we avoid candidate generation?
And how can we reduce the number of scanning database?
In this paper, we propose a non-ambiguous and compact representation to express a temporal pattern.
And also propose an efficient method that can simplify the process of complex relation,
And avoid candidate generation.
By our observation, there are three important issues for mining closed temporal patterns.
Complex relationship.
Since an interval has duration time, the relation between any two intervals is complicated,
Representation.
Since all Allen’s temporal relations are binary relations.
That means it can describe the relation between any two intervals easily.
However, when we want to express the relation more than 3 intervals, here comes the problem.
How can we express a pattern non-ambiguously?
And what is the space usage to describe a pattern correctly?
The last issue is how to design an efficient algorithm?
Could we avoid candidate generation?
And how can we reduce the number of scanning database?
In this paper, we propose a non-ambiguous and compact representation to express a temporal pattern.
And also propose an efficient method that can simplify the process of complex relation,
And avoid candidate generation.
We propose a new representation named coincidence representation.
We use the global information of a sequence to segment intervals into disjoint slices
It’s a nonambiguous and compact representation.
Based on coincidence representation, we propose a CTMiner algorithm.
It’s a pattern-growth approach.
And do not need candidate generation and test.
CTMiner can be decomposed into two components,
Incision strategy and CprefixSpan.
Incision strategy is used to transform sequence into coincidence representation.
CprefixSpan is used to mine all frequent temporal patterns.
Coincidence representation segments intervals into disjoint slices,
according to the arrangement of all end time points in the sequence.
There are four kinds of event slices:
Start slice, like this is the start slice of A, start slice of B, Start slice of C.
Intermediate slice, like this is the intermediate slice of C.
Finish slice, like this is the finish slice of A, finish slice of B, finish slice of C.
Intact slice, means do not be cut.
Like, intact slice of D, intact slice of E.
Coincidence is the group of event slices which occurs simultaneously.
So these are coincidences.
Concatenate all coincidences together can form the coincidence representation of a sequence.
CTMiner has two main components
Incision strategy and CprefixSpan.
Incision strategy is used to transform sequence into coincidence representation.
We first put all end time points of all intervals into endtime_list.
Then sort by time in increasing order.
And merge two symbols if time and type are the same.
Like B’s start time and C’s start time.
Then trace endtime_list one-by-one can transform sequence into coincidence representation.
Like this example.
Temporal representation
It use the sequence of order time point to express a temporal pattern.
plus represent the start time and minus represent the finish time.
Like this example, pattern ABCD is expressed as this sequence.
The start time of A is smaller than the start time of B, so here is A smaller then B.
The start time of B is equal to the start time of C, so here is B equal C.
Temporal representation has no ambiguous problem.
It uses 4k 1 space to describe a k-pattern.
( 2k for event index, and 2k minus 1 for relation describer)
TprefixSpan adopts this representation
Mining closed pattern actually require a complicated process.
It usually need a lot of closure checking.
That is checking whether the mining pattern is a sub-pattern of existed pattern,
Or the previous mining pattern is a sub-pattern of current mining pattern.
We propose an endpoint representation.
Which can transform quickly from interval data.
Based on this simple representation, we can do the closure checking easily.
It also is a nonambiguous and compact representation.
So we can transform the example database to a endpoint database.
Every coincidence in the coincidence sequence is disjoint.
So the relations among slices are simple, just before, equal and after.
We borrow the idea of PrefixSpan
An efficient algorithm for time-point data.
CprefixSpan has three step:
Scan local database to find frequent slices
Append and extend the pattern
Project database
By utilizing the concept of slice and coincidence,
We propose two pruning strategies to reduce search space.
Pre-pruning and post-pruning.
This is the overview of CprefixSpan.
We first scan database to get all frequent intervals.
And transform sequences and project database by these frequent intervals.
Then project database and append & extend pattern recursively.
Finally, collect all mining results can get all frequent temporal patterns.
Pre-pruning strategy is based on the concept of slice and coincidence.
Since the start slices and finish slices always occur in pairs,
We only require projecting the frequent finish slices which have the corresponding start slices in their prefixes.
Like this example, for the projected database of A plus,
After scanning database, suppose we have four frequent slices, A, B+, B and C.
Then we append and extend pattern
and build four projected database.
However, B minus do not have corresponding B plus in prefix pattern.
So by pre-pruning strategy, we can avoid a lot of non-promising projections.
Post-pruning strategy is based on the concept of slice and coincidence.
Since the start slices and finish slices always occur in pairs,
And Start slice always appear before finish slice.
So when we building a projected database,
We only collect the significant postfixes.
significant postfixes means with respect to a prefix , all finish slices in postfix have corresponding start slices in .
For example, given a coincidence database,
When we building the projected database with E,
All three sequences are insignificant.
So, the projected database with E can be post-pruning.
Mining closed pattern actually require a complicated process.
It usually need a lot of closure checking.
That is checking whether the mining pattern is a sub-pattern of existed pattern,
Or the previous mining pattern is a sub-pattern of current mining pattern.
We propose an endpoint representation.
Which can transform quickly from interval data.
Based on this simple representation, we can do the closure checking easily.
It also is a nonambiguous and compact representation.
So we can transform the example database to a endpoint database.
Mining closed pattern actually require a complicated process.
It usually need a lot of closure checking.
That is checking whether the mining pattern is a sub-pattern of existed pattern,
Or the previous mining pattern is a sub-pattern of current mining pattern.
We propose an endpoint representation.
Which can transform quickly from interval data.
Based on this simple representation, we can do the closure checking easily.
It also is a nonambiguous and compact representation.
So we can transform the example database to a endpoint database.
Based on endpoint representation, we propose a CEMiner algorithm.
It’s a pattern-growth approach.
And do not need candidate generation and test.
CEMiner can be decomposed into two components,
Closure checking and Pruning Strategy.
Sequential pattern mining is an important and basic research in data mining domain because it is useful in many applications.
Such as, customer’s analysis, network intrusion detection, and finding tandem repeats in DNA sequences, to name a few.
Like this example.
We can find some behaviors and common patterns from customer’s buying record,
like: buy milk than buy (beer & diaper) together.
However, in some real world applications, an event is usually a time interval instead of a time point.
It usually has a duration time.
Like, library reader analysis, patient disease analysis, stock fluctuation, to name a few.
Like this example.
We can find the correlation between diseases from patient’s record.
The relation between time points is simple, just before, equal and after.
But the relation between time intervals is quite different.
It is more complicated.
We usually use Allen’s 13 temporal relation to describe the relationship between any two time intervals .
According to eMarketer, we can know that advertisement spending on social networking grow rapidly every year.
Look at this figure,
In 2008, the total expense was about $23.3 billions.
In 2010, it was about $23.6 billions.
But in 2011, it almost reached $26 billions.
So, obviously, the research and related technology on social network marketing is very very important
According to eMarketer, we can know that advertisement spending on social networking grow rapidly every year.
Look at this figure,
In 2008, the total expense was about $23.3 billions.
In 2010, it was about $23.6 billions.
But in 2011, it almost reached $26 billions.
So, obviously, the research and related technology on social network marketing is very very important
This is the framework of CDH algorithm
Given a social network represented as a graph, we first detect community.
Then use community information to construct the potential pool and find the fundamental nodes from the pool.
Finally, adjust the fundamental nodes to select the seed nodes.