6. A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
Items within an element are unordered
10 <a(abc)(ac)d(cf)>
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential
pattern 6
7. CHALLENGES ON SEQUENTIAL
PATTERN MINING
A huge number of possible sequential patterns are hidden in
databases
A mining algorithm should
find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
be highly efficient, scalable, involving only a small number of
database scans
be able to incorporate various kinds of user-specific
constraints
7
11. The Apriori Algorithm [Pseudo-Code]
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
11
12. APRIORI ADV/DISADV
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Requires up to m database scans.
13. J. Han, J. Pei, and Y. Yin 2000
Depth-first search
Avoid explicit candidate generation
Adopt divide-and-conquer strategy
Two step approach
Step1:Build a compact data
structure called FP tree
Step2:Extract frequent itemsets
from FP tree.
14. Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over the data-set:
Pass 1:
Scan data and find support for each item.
Discard infrequent items.
Sort frequent items in decreasing order based on
their support.
15. Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions share items (when
they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree may fit in
memory.
4. Frequent itemsets extracted from the FP-Tree.
16. Start from each frequent length-1 pattern (as an initial suffix
pattern)
construct its conditional pattern base (a ―subdatabase,‖which
consists of the set of prefix paths in the FP-tree co-occurring
with the suffix pattern)
Construct its (conditional) FP-tree, and perform mining
recursively on such a tree.
The pattern growth is achieved by the concatenation of the
suffix pattern with the frequent patterns generated from a
conditional FP-tree.
17. Table : Table after
first scan of database
Table : Transactional data
21. FP-FROWTH ADV/DISADV
Advantages of FP-Growth
• only 2 passes over data-set
• ―compresses‖ data-set
• no candidate generation
• much faster than Apriori
Disadvantages of FP-Growth
• FP-Tree may not fit in memory!!
• FP-Tree is expensive to build
22. APPLICATIONS
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3
months.
Medical treatments, natural disasters (e.g., earthquakes), science
& eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
22