SlideShare a Scribd company logo
1 of 25
Finite-State Queries in Lucene
             Robert Muir
          rmuir@apache.org
Agenda
• Introduction to Lucene
• Improving inexact matching:
  – Background
  – Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
  – Language support: expansion versus
    stemming
  – Improved spellchecking
• Other ongoing developments in Lucene
Introduction to Lucene
• Open Source Search Engine Library
  – Not just Java, ported to other languages too.
  – Commercial support via several companies
• Just the library
  – Embed for your own uses, e.g. Eclipse
  – For a search server, see Solr
  – For web search + crawler, see Nutch
• Website: http://lucene.apache.org
Inverted Index
• Like a TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
  “subtree”.
  – For example, Regex and Fuzzy exhaustively
    evaluate all terms, unless you give them a
    “constant prefix”.
• TermQuery is just a special case, looks at
  one leaf node.
Lucene 2.9: Fast Numeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
  – But typically this is a small number: e.g. 15
• Query APIs improved to support this.

                   4                       5                           6



            42           44            52                  63                    64



      421    423   445   446   448   521       522   632   633   634       641   642   644
Automaton Queries
• Only explore subtrees that can lead to an
  accept state of some finite state machine.
• AutomatonQuery traverses the term
  dictionary and the state machine in parallel
Another way to think of it
• Index as a state machine that recognizes Terms
  and transduces matching Documents.
• AutomatonQuery represents a user’s search
  need as a FSM.
• The intersection of the two emits search results.
Query API improvements
• Automata might need to do many seeks
  around the term dictionary.
  – Depends on what is in term dictionary
  – Depends on state machine structure
• MultiTermQuery API further improved
  – Easier and more efficient to skip around.
  – Explicitly supports seeking.
Regex, Wildcard, Fuzzy
• Without constant prefix, exhaustive
  – Regex: (http|ftp)://foo.com
  – Wildcard: ?oo?ar
  – Fuzzy: foobar~
• Re-implemented as automata queries
  – Just parsers that produce a DFA
  – Improved performance and scalability
  – (http|ftp)://foo.com examines 2 terms.
Additional/Advanced Use Cases
Stemming
• Stemmers work at index and query time
  – walked, walking -> walk
  – Can increase retrieval effectiveness
• Some problems
  – Mistakes: international -> intern
  – Must determine language of documents
  – Multilingual cases can get messy
  – Tuning is difficult: must re-index
  – Unfriendly: wildcards on stemmed terms…
Expansion instead
• Don’t remove data at index time
  – Expand the query instead.
  – Single field now works well for all queries:
    exact match, wildcard, expanded, etc.
• Simplifies search configuration
  – Tuning relevance is easier, no re-indexing.
  – No need to worry about language ID for docs.
  – Multilingual case is much simpler.
Automata expansion
• Natural fit for morphology
• Use set intersection operators
  – Minus to subtract exact match case
  – Union to search multiple languages
• Efficient operation
  – Doesn’t explode for languages with complex
    morphology
Experimental results
• 125k docs English test collection
   • Results are for TD queries
• Inverted the “S-Stemmer”
   • 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
                  No      Porter    S-Stem    Automaton
               Stemming                        S-Stem
      MAP       0.4575    0.5069    0.5029     0.4979
      MRR       0.8070    0.7862    0.7587     0.8220
     # Terms   336,675    280,061   305,710    336,675
TODO:
• Support expansion models, too in Lucene.
• Language-specific resources
  – lucene-hunspell could provide these
• Language-independent tokenization
  – Unicode rules go a long way.
• Scoring that doesn’t need stopwords
  – For now, use CommonGrams!
Spellchecking




• Lucene spellchecker builds a separate
  index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
  enough for small edit distances (e.g. 1,2)
  to just use the index directly.
• Could simplify configurations, especially
  distributed ones.
Ongoing developments in Lucene
Community
• Merging Lucene and Solr development
  – Still two separate released “products”!!!
  – Share mailing list and code repository
  – Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
  – Lucene features exposed to Solr faster
  – Solr features available to Lucene users
Indexing
• Flexible Indexing
  – Customize the format of the index
  – Decreased RAM usage
  – Faster IndexReader open and faster seeking
• Future
  – Serialize custom attributes to the index
  – More RAM savings
  – Improved index compression
  – Faster Near-Real-Time performance
Text Analysis
• Improved Unicode Support
  – Unicode 4/Supplementary support
  – ICU integration (to support Unicode 5.2)
• Improved Language Support
  – More languages
  – Faster indexing performance
  – Easier packaging and ease-of-use
• Improvements to Analysis API
Relevance and Scoring
• Improved relevance benchmarking
  – Open Relevance Project
  – Quality Benchmarking Package
• Future: More flexible scoring
  – How to support BM25, DFR, more vector
    space?
  – Some packages/patches available, but
     • Additional index statistics needed to be simple
Other
• Improvements to Spatial support
  – See Chris Male’s talk here:
    http://vimeo.com/10204365
  – Progress in both Lucene and Solr
• Steps towards a more modular
  architecture
  – Smaller Lucene core
  – Separate modules with more functionality
Questions?
Backup slides
Finite State Queries In Lucene

More Related Content

What's hot

Microservices with Java, Spring Boot and Spring Cloud
Microservices with Java, Spring Boot and Spring CloudMicroservices with Java, Spring Boot and Spring Cloud
Microservices with Java, Spring Boot and Spring CloudEberhard Wolff
 
SPAのルーティングの話
SPAのルーティングの話SPAのルーティングの話
SPAのルーティングの話ushiboy
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryMichael Dürig
 
Introduction to Hibernate Framework
Introduction to Hibernate FrameworkIntroduction to Hibernate Framework
Introduction to Hibernate FrameworkMohit Kanwar
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling TwitterBlaine
 
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf株式会社MonotaRO Tech Team
 
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Cyrille Le Clerc
 
Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019J V
 
Node.jsで使えるファイルDB"NeDB"のススメ
Node.jsで使えるファイルDB"NeDB"のススメNode.jsで使えるファイルDB"NeDB"のススメ
Node.jsで使えるファイルDB"NeDB"のススメIsamu Suzuki
 
Feature Toggle
Feature ToggleFeature Toggle
Feature ToggleBryan Liu
 
Intro to Pentesting Jenkins
Intro to Pentesting JenkinsIntro to Pentesting Jenkins
Intro to Pentesting JenkinsBrian Hysell
 
Dockerの事例紹介
Dockerの事例紹介Dockerの事例紹介
Dockerの事例紹介Hiroki Endo
 
Alfresco REST API of the future ... is closer than you think
Alfresco REST API of the future ... is closer than you thinkAlfresco REST API of the future ... is closer than you think
Alfresco REST API of the future ... is closer than you thinkJ V
 
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」Takuto Wada
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
Introduction to git flow
Introduction to git flowIntroduction to git flow
Introduction to git flowKnoldus Inc.
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Firebase Authを Nuxt + Railsの自前サービス に導入してみた
Firebase Authを Nuxt + Railsの自前サービス に導入してみたFirebase Authを Nuxt + Railsの自前サービス に導入してみた
Firebase Authを Nuxt + Railsの自前サービス に導入してみたTomoe Sawai
 

What's hot (20)

Microservices with Java, Spring Boot and Spring Cloud
Microservices with Java, Spring Boot and Spring CloudMicroservices with Java, Spring Boot and Spring Cloud
Microservices with Java, Spring Boot and Spring Cloud
 
SPAのルーティングの話
SPAのルーティングの話SPAのルーティングの話
SPAのルーティングの話
 
Git collaboration
Git collaborationGit collaboration
Git collaboration
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new Repository
 
Introduction to Hibernate Framework
Introduction to Hibernate FrameworkIntroduction to Hibernate Framework
Introduction to Hibernate Framework
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling Twitter
 
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf
信頼性とアジリティを同時に上げろ!モノタロウのカナリアリリース導入.pdf
 
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
Open Source Monitoring for Java with JMX and Graphite (GeeCON 2013)
 
Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019Alfresco Transform Service DevCon 2019
Alfresco Transform Service DevCon 2019
 
Node.jsで使えるファイルDB"NeDB"のススメ
Node.jsで使えるファイルDB"NeDB"のススメNode.jsで使えるファイルDB"NeDB"のススメ
Node.jsで使えるファイルDB"NeDB"のススメ
 
Feature Toggle
Feature ToggleFeature Toggle
Feature Toggle
 
Intro to Pentesting Jenkins
Intro to Pentesting JenkinsIntro to Pentesting Jenkins
Intro to Pentesting Jenkins
 
Dockerの事例紹介
Dockerの事例紹介Dockerの事例紹介
Dockerの事例紹介
 
Alfresco REST API of the future ... is closer than you think
Alfresco REST API of the future ... is closer than you thinkAlfresco REST API of the future ... is closer than you think
Alfresco REST API of the future ... is closer than you think
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Introduction to git flow
Introduction to git flowIntroduction to git flow
Introduction to git flow
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Firebase Authを Nuxt + Railsの自前サービス に導入してみた
Firebase Authを Nuxt + Railsの自前サービス に導入してみたFirebase Authを Nuxt + Railsの自前サービス に導入してみた
Firebase Authを Nuxt + Railsの自前サービス に導入してみた
 

Viewers also liked

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)otisg
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 

Viewers also liked (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 

Similar to Finite State Queries In Lucene

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and AbstractionsMetosin Oy
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsMasoud Kalali
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS - The Language Data Network
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 

Similar to Finite State Queries In Lucene (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Solr 4
Solr 4Solr 4
Solr 4
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 

Finite State Queries In Lucene

  • 1. Finite-State Queries in Lucene Robert Muir rmuir@apache.org
  • 2. Agenda • Introduction to Lucene • Improving inexact matching: – Background – Regular Expression, Wildcard, Fuzzy Queries • Additional use cases: – Language support: expansion versus stemming – Improved spellchecking • Other ongoing developments in Lucene
  • 3. Introduction to Lucene • Open Source Search Engine Library – Not just Java, ported to other languages too. – Commercial support via several companies • Just the library – Embed for your own uses, e.g. Eclipse – For a search server, see Solr – For web search + crawler, see Nutch • Website: http://lucene.apache.org
  • 4. Inverted Index • Like a TreeMap<Terms, Documents> • Before 2.9, queries only operate on one “subtree”. – For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a “constant prefix”. • TermQuery is just a special case, looks at one leaf node.
  • 5. Lucene 2.9: Fast Numeric Ranges • Indexes at different levels of precision. • Enumerates multiple subtrees. – But typically this is a small number: e.g. 15 • Query APIs improved to support this. 4 5 6 42 44 52 63 64 421 423 445 446 448 521 522 632 633 634 641 642 644
  • 6. Automaton Queries • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel
  • 7. Another way to think of it • Index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search need as a FSM. • The intersection of the two emits search results.
  • 8. Query API improvements • Automata might need to do many seeks around the term dictionary. – Depends on what is in term dictionary – Depends on state machine structure • MultiTermQuery API further improved – Easier and more efficient to skip around. – Explicitly supports seeking.
  • 9. Regex, Wildcard, Fuzzy • Without constant prefix, exhaustive – Regex: (http|ftp)://foo.com – Wildcard: ?oo?ar – Fuzzy: foobar~ • Re-implemented as automata queries – Just parsers that produce a DFA – Improved performance and scalability – (http|ftp)://foo.com examines 2 terms.
  • 11. Stemming • Stemmers work at index and query time – walked, walking -> walk – Can increase retrieval effectiveness • Some problems – Mistakes: international -> intern – Must determine language of documents – Multilingual cases can get messy – Tuning is difficult: must re-index – Unfriendly: wildcards on stemmed terms…
  • 12. Expansion instead • Don’t remove data at index time – Expand the query instead. – Single field now works well for all queries: exact match, wildcard, expanded, etc. • Simplifies search configuration – Tuning relevance is easier, no re-indexing. – No need to worry about language ID for docs. – Multilingual case is much simpler.
  • 13. Automata expansion • Natural fit for morphology • Use set intersection operators – Minus to subtract exact match case – Union to search multiple languages • Efficient operation – Doesn’t explode for languages with complex morphology
  • 14. Experimental results • 125k docs English test collection • Results are for TD queries • Inverted the “S-Stemmer” • 6 declarative rewrite rules to regex • Competitive with traditional stemming. No Porter S-Stem Automaton Stemming S-Stem MAP 0.4575 0.5069 0.5029 0.4979 MRR 0.8070 0.7862 0.7587 0.8220 # Terms 336,675 280,061 305,710 336,675
  • 15. TODO: • Support expansion models, too in Lucene. • Language-specific resources – lucene-hunspell could provide these • Language-independent tokenization – Unicode rules go a long way. • Scoring that doesn’t need stopwords – For now, use CommonGrams!
  • 16. Spellchecking • Lucene spellchecker builds a separate index to find correction candidates • Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. 1,2) to just use the index directly. • Could simplify configurations, especially distributed ones.
  • 18. Community • Merging Lucene and Solr development – Still two separate released “products”!!! – Share mailing list and code repository – Solr dev code in sync with Lucene dev code • Benefits to both Lucene and Solr users – Lucene features exposed to Solr faster – Solr features available to Lucene users
  • 19. Indexing • Flexible Indexing – Customize the format of the index – Decreased RAM usage – Faster IndexReader open and faster seeking • Future – Serialize custom attributes to the index – More RAM savings – Improved index compression – Faster Near-Real-Time performance
  • 20. Text Analysis • Improved Unicode Support – Unicode 4/Supplementary support – ICU integration (to support Unicode 5.2) • Improved Language Support – More languages – Faster indexing performance – Easier packaging and ease-of-use • Improvements to Analysis API
  • 21. Relevance and Scoring • Improved relevance benchmarking – Open Relevance Project – Quality Benchmarking Package • Future: More flexible scoring – How to support BM25, DFR, more vector space? – Some packages/patches available, but • Additional index statistics needed to be simple
  • 22. Other • Improvements to Spatial support – See Chris Male’s talk here: http://vimeo.com/10204365 – Progress in both Lucene and Solr • Steps towards a more modular architecture – Smaller Lucene core – Separate modules with more functionality