SlideShare a Scribd company logo
1 of 47
Download to read offline
Finite State Automata
   in
       Dawid WEISS
.



            Dawid Weiss
        .
            20+ years of coding
            10 years assembly only

    .       Academia & Research
            PhD in Information Retrieval, PUT

            Open source
            Carrot2 , HPPC, Lucene, …

            Industry & Business
            Carrot Search s.c.




.       .
Talk outline

State machines (automata)
FSAs, DFAs, FSTs and other XXXs.

Use cases in Lucene and Solr
Suggester. FuzzySearch. Index.

No API details
Still @experimental.
(Non)? Deterministic Finite
State (Automata|Machines)
HashSet
hash         → slot   → value
0x29384d34            → lucene
0xde3e3354            → lucid
0x00000666            → lucifer
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l      u      c       e       n       e
                                 i
                                            d
                                                    r
                                        f
                                                e
HashSet
hash           → slot       → value
0x29384d34                  → lucene
0xde3e3354                  → lucid
0x00000666                  → lucifer

FSA (deterministic)
          l        u    c       e       n       e
                                 i
                                            d
exists(sequence)                                    r
 oor(pre x)                             f
ceil(pre x)                                     e
k   i   l   l

b           l   deterministic, non-minimal
    i   l
k   i   l   l

b           l   deterministic, non-minimal
    i   l



k
    i   l   l
                deterministic, minimal
b
k   i   l    l

b            l   deterministic, non-minimal
    i   l



k
    i   l    l
                 deterministic, minimal
b


k
    i    l
             l   non-deterministic,
    i    l
                 non-minimal
b
(Sorted)Map

lucene    → 1
lucid     → 2
lucifer   → 666
(Sorted)Map

lucene        → 1
lucid         → 2
lucifer       → 666

FST (transducer)
          l        u   c   e   n         e|1
                           i
                                   d|2
                                             r|666
                               f
                                         e
(Sorted)Map

lucene     → 1
lucid      → 2
lucifer    → 666

FST (transducer)
         l|1       u   c   e     n           e
                           i|1
                                     d
                                                 r
                                 f|664
                                         e
NFSAs and
Regular expressions
                                                    a
                                          a


                                        e1e2   e1           e1



Determinization                          e+
                                                        e
states explosion, not always possible

Backtracking
recursion explosion                      e*
                                                        e



                                         e?
                                                        e
a?nan
a?nan
n=3 → a?a?a?aaa
a?nan
                      n=3 → a?a?a?aaa




Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
35000


            30000


            25000
Time [ms]




            20000


            15000


            10000


             5000


               0
                    0                 5              10               15               20              25     30




                        Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
Linear-time, minimal, deterministic
FSA construction

Linear algorithm from sorted input
by Daciuk, Mihov, et al.

Active path
states that still can change

States dictionary
nodes that will never change
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




lucene
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




        l    u     c       e   n   e




lucid
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n   e
                           i
                               d
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




          l   u    c       e   n   e
                           i
                               d




lucifer
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d

                               f
                                       e   r
1) common AP pre x
2) freeze the rest of AP
3) add suffix → new AP




       l     u     c       e   n       e
                           i
                                   d
                                           r
                               f
                                       e
FS(A|T)s in (Lucene|Solr)
Automata in
Lucene|Solr

org.apache.lucene.util.automaton.*
partial port of brics, FuzzyQuery, AutomatonTermsEnum

org.apache.lucene.util.automaton.fst.FST
FSA and FSTs from sorted data, suggester, indexes
org.apache.lucene.util.automaton.fst.*
FSA representation

Arc-based, not state-based
Moore vs. Mealy. Compact vs. intuitive




              Input: abc, bd, bde.
                      a      b       c       a   b   c

                      b          d           b
                                         e       d   e
                              d
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL      a   bL




                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient


                                              s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
org.apache.lucene.util.automaton.fst.*
FSA representation
                                                   a        b        c
Arc-based, not state-based                    s3       s2       s1
Moore vs. Mealy. Compact vs. intuitive             b        d        e
                                                       s4       s5
Next-state chaining
requires unusual tricks during construction
                                              s1 s2 s5 s4            s3
Everything in a byte[]                        cFL bL eFL dL      a   bL
traversals-ready, memory-efficient

Dual transition storage format
lookup: bsearch or linear scan                s1 s2 s5 s4            s3
                                              cFL bL eFL dL bLN a
Input size       Compressed size (MB)
Input               MB        Terms    Lucene    morf.   gzip
Wikipedia t.index   481   38 092 045      258     164    149
Polish in .         162    3 672 200       3.1     1.7   15.4




        .
Use Cases:
Solr's Autocomplete
Solr's
Suggesters

Design choices
sort order (alpha, score), pre x vs. spelling, boost exact matches?

Weights
term→weight, lookup(term, onlyMorePopular)

org.apache.solr.spelling.suggest.Lookup
JaspellLookup, TSTLookup, FSTLookup
flour|3
    four|4
    fourier|3
    furious|2




                .
                .
                Take 1 .
.
flour|3
           four|4
                                          →fou*
           fourier|3
           furious|2



                       o             u
                 l
                                     i     e      r       |
           f     o     u      r
                                     |                        3
                 u                                    4

                        r                                     2
                              i      o     u      s       |



    Find pre x.
    Depth-in traversal for completions.
    PQ on score|alpha
                                                                  .
                                                                  .
                                                                  Take 1 .
.
2furious
    3flour
    3fourier
    4four




               .
               .
               Take 2 .
.
2furious
           3flour
                                              →fou*
           3fourier
           4four


                            u
                    f              r
                                          i       o      u
            2                                                  s
                            l
            3       f             u       r       i      e     r
                            o
            4                                     u
                        f          o



    From score roots, until N collected.
    Find pre x.
    Depth-in traversal for completions, stop if N collected.
    Find/boost exact match.                                        .
                                                                   .
                                                                   Take 2 .
.
2furious
    5urious|furious
    5rious|furious
    5ious|furious
    5ous|furious
    5us|furious
    5s|furious
    3flour
    …




                      .
                      Take 3 (in xes) .
                      .
.
2



                i

                    o
                                o
                                        u
                        i
            r                                   s
                r           s                           |
                                                            f

    5   u

                                s
                                                                u


                4

        o       u
    6                                                               r
            u
                r
                        r       |                   .                   i   o   u
                                                                                    s
                                                f
    7                                                                   u           r
                                                            o
                                                        l
                                            f               o               i   e
                3                                       l           u   r

                        e                       f               o
                                        |
                                                            f
                                r                       |
                                                r
                    i
        l       o       u               e
                                i
        o                       |   |
                r
                                i
                u       r
            u
.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.
Constant time lookups!
  Regardless of the terms dictionary size.
       Regardless of pre x length.


            Exact matches only.
    Static snapshot (not incremental).
            Discretized weights.
Top50KWiki.utf8, 676 KB, 50 000 terms

                    Jaspell        TST           FST
       .
    RAM [B]            .
                   7 869 415         .
                                7 914 524      300 .175



                       queries per second,. . . tpq
        .
PREFIX [100-200]      .
                     458             .
                                   966              .
                                                  742


       .
  PREFIX [6-9]        .
                     330            .
                                   228            .
                                                 659


       .
  PREFIX [2-4]        .
                     126            .
                                   29
                                    .             .
                                                 501
Summary
Summary and Conclusions
Automata
compact, powerful, efficient data structure

Lucene/Solr bene ts
behind the scenes, but spreading: index, queries, suggesters

API in Lucene
…is shaped right now, still @experimental
Acknowledgement

Michael McCandless

Robert Muir

committer: .+
dawid.weiss@carrotsearch.com

More Related Content

What's hot

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...confluent
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리Junyi Song
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나Jang Hoon
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...HostedbyConfluent
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 

What's hot (20)

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 

Viewers also liked

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Lucandra
LucandraLucandra
Lucandraotisg
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene
LuceneLucene
Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Lucandra
LucandraLucandra
Lucandra
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 

Dawid Weiss- Finite state automata in lucene

  • 1. Finite State Automata in Dawid WEISS
  • 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c. . .
  • 3. Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs. Use cases in Lucene and Solr Suggester. FuzzySearch. Index. No API details Still @experimental.
  • 4. (Non)? Deterministic Finite State (Automata|Machines)
  • 5. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer
  • 6. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d r f e
  • 7. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d exists(sequence) r oor(pre x) f ceil(pre x) e
  • 8. k i l l b l deterministic, non-minimal i l
  • 9. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b
  • 10. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b k i l l non-deterministic, i l non-minimal b
  • 11. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666
  • 12. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l u c e n e|1 i d|2 r|666 f e
  • 13. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l|1 u c e n e i|1 d r f|664 e
  • 14. NFSAs and Regular expressions a a e1e2 e1 e1 Determinization e+ e states explosion, not always possible Backtracking recursion explosion e* e e? e
  • 15. a?nan
  • 17. a?nan n=3 → a?a?a?aaa Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  • 18. 35000 30000 25000 Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  • 19. Linear-time, minimal, deterministic FSA construction Linear algorithm from sorted input by Daciuk, Mihov, et al. Active path states that still can change States dictionary nodes that will never change
  • 20. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP lucene
  • 21. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e lucid
  • 22. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d
  • 23. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d lucifer
  • 24. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d f e r
  • 25. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d r f e
  • 27. Automata in Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes
  • 28. org.apache.lucene.util.automaton.fst.* FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  • 29. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 30. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 31. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient Dual transition storage format lookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  • 32. Input size Compressed size (MB) Input MB Terms Lucene morf. gzip Wikipedia t.index 481 38 092 045 258 164 149 Polish in . 162 3 672 200 3.1 1.7 15.4 .
  • 34. Solr's Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches? Weights term→weight, lookup(term, onlyMorePopular) org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup
  • 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 . .
  • 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 . .
  • 37. 2furious 3flour 3fourier 4four . . Take 2 . .
  • 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 . .
  • 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . . .
  • 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u .
  • 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  • 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  • 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq . PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  • 45. Summary and Conclusions Automata compact, powerful, efficient data structure Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters API in Lucene …is shaped right now, still @experimental