Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.
2. .
Dawid Weiss
.
20+ years of coding
10 years assembly only
. Academia & Research
PhD in Information Retrieval, PUT
Open source
Carrot2 , HPPC, Lucene, …
Industry & Business
Carrot Search s.c.
. .
3. Talk outline
State machines (automata)
FSAs, DFAs, FSTs and other XXXs.
Use cases in Lucene and Solr
Suggester. FuzzySearch. Index.
No API details
Still @experimental.
6. HashSet
hash → slot → value
0x29384d34 → lucene
0xde3e3354 → lucid
0x00000666 → lucifer
FSA (deterministic)
l u c e n e
i
d
r
f
e
7. HashSet
hash → slot → value
0x29384d34 → lucene
0xde3e3354 → lucid
0x00000666 → lucifer
FSA (deterministic)
l u c e n e
i
d
exists(sequence) r
oor(pre x) f
ceil(pre x) e
17. a?nan
n=3 → a?a?a?aaa
Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
18. 35000
30000
25000
Time [ms]
20000
15000
10000
5000
0
0 5 10 15 20 25 30
Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
19. Linear-time, minimal, deterministic
FSA construction
Linear algorithm from sorted input
by Daciuk, Mihov, et al.
Active path
states that still can change
States dictionary
nodes that will never change
20. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
lucene
21. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
l u c e n e
lucid
22. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
l u c e n e
i
d
23. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
l u c e n e
i
d
lucifer
24. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
l u c e n e
i
d
f
e r
25. 1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP
l u c e n e
i
d
r
f
e
29. org.apache.lucene.util.automaton.fst.*
FSA representation
a b c
Arc-based, not state-based s3 s2 s1
Moore vs. Mealy. Compact vs. intuitive b d e
s4 s5
Next-state chaining
requires unusual tricks during construction
s1 s2 s5 s4 s3
cFL bL eFL dL a bL
s1 s2 s5 s4 s3
cFL bL eFL dL bLN a
30. org.apache.lucene.util.automaton.fst.*
FSA representation
a b c
Arc-based, not state-based s3 s2 s1
Moore vs. Mealy. Compact vs. intuitive b d e
s4 s5
Next-state chaining
requires unusual tricks during construction
s1 s2 s5 s4 s3
Everything in a byte[] cFL bL eFL dL a bL
traversals-ready, memory-efficient
s1 s2 s5 s4 s3
cFL bL eFL dL bLN a
31. org.apache.lucene.util.automaton.fst.*
FSA representation
a b c
Arc-based, not state-based s3 s2 s1
Moore vs. Mealy. Compact vs. intuitive b d e
s4 s5
Next-state chaining
requires unusual tricks during construction
s1 s2 s5 s4 s3
Everything in a byte[] cFL bL eFL dL a bL
traversals-ready, memory-efficient
Dual transition storage format
lookup: bsearch or linear scan s1 s2 s5 s4 s3
cFL bL eFL dL bLN a
36. flour|3
four|4
→fou*
fourier|3
furious|2
o u
l
i e r |
f o u r
| 3
u 4
r 2
i o u s |
Find pre x.
Depth-in traversal for completions.
PQ on score|alpha
.
.
Take 1 .
.
38. 2furious
3flour
→fou*
3fourier
4four
u
f r
i o u
2 s
l
3 f u r i e r
o
4 u
f o
From score roots, until N collected.
Find pre x.
Depth-in traversal for completions, stop if N collected.
Find/boost exact match. .
.
Take 2 .
.
39. 2furious
5urious|furious
5rious|furious
5ious|furious
5ous|furious
5us|furious
5s|furious
3flour
…
.
Take 3 (in xes) .
.
.
40. 2
i
o
o
u
i
r s
r s |
f
5 u
s
u
4
o u
6 r
u
r
r | . i o u
s
f
7 u r
o
l
f o i e
3 l u r
e f o
|
f
r |
r
i
l o u e
i
o | |
r
i
u r
u
.
41. Constant time lookups!
Regardless of the terms dictionary size.
Regardless of pre x length.
42. Constant time lookups!
Regardless of the terms dictionary size.
Regardless of pre x length.
Exact matches only.
Static snapshot (not incremental).
Discretized weights.
45. Summary and Conclusions
Automata
compact, powerful, efficient data structure
Lucene/Solr bene ts
behind the scenes, but spreading: index, queries, suggesters
API in Lucene
…is shaped right now, still @experimental