2. Topics
• Lucene 4 Beta released this week
• Key Features
• Community
• Evaluation
3. Features
• Quick Hit:
– Language Analysis
• UNICODE compliant
• 32+ languages
• 100+ TokenStreams
– Ancillary
• Faceting, spelling, MLT, Joins,
collapsing, highlighting,
benchmarking, …
• More to come:
– FSTs
– Indexing and Storage
– Search
4. FS(A|T)
• Keys:
– byte[] – write-once
– Linear time build of min. automata (nlogn if not sorted, which isn’t our case)
– Compression
– Reverse lookups
– Weights (used for auto-suggest)
– Pluggable Algebra
• Uses:
– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
– http://slidesha.re/vKtpVA
– http://bit.ly/Pkjyu0
– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol.
6807, 2011, pp. 118—192.
5. Indexing and Storage
• Segmented, write-once approach with
merging
• Fast: http://bit.ly/l8qE0i
– 23.2 GB Wikipedia in 5 minutes
– 270 GB/hour of plain text
• Near Real Time Indexing/Search
• Codecs
– Abstraction for: Dictionaries, Postings, Field
Storage, Term Vectors and more
– Lucene40 is default – uses Block Tree
– For fun: SimpleTextCodec
• Directory
– Abstraction for IO
6. Search
• Many query types, query parsers, filtering
capabilities
• DAAT (mostly) evaluation
• Pluggable Similarity
– Many implementations and room for more
• BM25, DFR, etc.
7. Community
• Large, diverse community with many non-traditional
search engine usages
– Object stores, Record linkage, mobile,
• Always Be Testing
– Randomized system tests are all the rage
– http://vimeo.com/32087114
• “The Apache Way”
• You never know where the next good idea is coming
from
8. Evaluation
• Performance • Relevance
– http://people.apache.or – Many people have done
g/~mikemccand/luceneb private evaluations
ench/ – Empirical/Anecdotal: $
queries, random sample
– More needed
http://people.apache.org/~mikemccand/lucenebench/indexing.html