More Related Content Similar to Open Source Search FTW (20) More from Grant Ingersoll (13) Open Source Search FTW1. © Copyright 2013
Open Source Search FTW!
Grant Ingersoll
Lucene/Solr Committer, Apache Soft.
Found.
CTO, LucidWorks
@gsingers
2. © 2013 LucidWorks
2
Preaching to the Converted!
• Embrace fuzziness!
• Search is a system building
block
• If the algorithms fit,
use them!
• Search use leads to search
abuse
• Scoring features are
everywhere
http://cheezburger.com/5243950080
3. © 2013 LucidWorks
3
Topics
• Quick Intro to Lucene and Solr
• What‘s new in Lucene and Solr 4.x?
- Lucene/Solr for Info Retrieval
• (Ab)Using Search Engine Tech. for Fun and Profit
5. © 2013 LucidWorks
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional
search engine usages
- Object stores, Record linkage, Social, mobile -> web
• Open Dev. > Open Source
• ―The Apache Way‖
- Meritocracy – Those who do, decide!
• Always Be Testing
- Randomized system tests are all the rage
- http://vimeo.com/32087114
• Patches Welcome!
8. © 2013 LucidWorks
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
- Per segment
- FieldCache can be controlled to only load new segments
- Soft commit -- faster without fsync, allows quicker update
visibility
• DWPT (Document Writer per Thread)
- Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• String -> BytesRef
- Much improved data structure
- … means less memory and less garbage collection effort
9. © 2013 LucidWorks
Up and to the Right
• http://people.apache.org/~mikemccand/lucenebench/in
dexing.html
9
10. © 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats
- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc
- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
- Decoupled from TF/IDF
- Built in alternatives include BM25 & DFR, and others
» http://en.wikipedia.org/wiki/Okapi_BM25
» http://terrier.org/docs/v3.5/dfr_description.html
- Add your own
11. © 2013 LucidWorks
FS(A|T)
• Keys:
- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted)
- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra
• Uses:
- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
- http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0
- ―Smaller Representation of Finite State Automata‖
» Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011,
vol. 6807, 2011, pp. 118—192.
13. © 2013 LucidWorks
Solr 4: New Features
• Search/Faceting/Relevance
- New Relevance Function Queries (tf, df, others)
- Pivot Faceting
- Pseudo-join
- Improved Spatial (more later)
- Full support for Lucene Codecs, pluggable scoring
• Indexing
- New Update Processors, including scripting option
- Near real time
• Codec/Similarity support from Lucene 4
• Other
- New Admin UI
14. © 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
- "geo‖:‖43.17614,-90.57341‖
- ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖
- ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖
• Searching:
- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))‖
15. © 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search
- Auto distributes updates and queries to appropriate shards
- Near Real Time (NRT) indexing capable
• Dynamically scalable
- New SolrCloud instances add indexing and query capacity
- Supports re-balancing
• Reliable
- No single point of failure
- Transactions logged
- Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
16. © 2013 LucidWorks
16
New in 4.4 (just released)
• HDFS backed directory for storing index and
transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI
18. © 2013 LucidWorks
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value
store
- Bonus: search your values!
• NoSQL before NoSQL was cool
• Solr: distributed key/value
- Durable, Isolated, Redundant, Fast,
Real-time
- Joins, Column Storage
• Solr or Tika + Lucene can index
popular office formats
• Solr can backup/replicate and
scale as content grows
19. © 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search
- with search used to build cross recommendation!
• Recommend content to people who exhibit certain
behaviors (clicks, query terms, other)
• (Ab)use of a search engine
- but not as a search engine for content
- more like a search engine for behavior
• See Ted Dunning‘s talk from Berlin Buzzwords on Multi-
modal Recommendation Algorithms
- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-
algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
21. © 2013 LucidWorks
21
… Time travel?
• Leverage Solr‘s new
spatial capabilities to
index non-spatial data,
such as time ranges
- Useful for Open Hours, Shifts,
etc.
• Query using rectangle
intersections
- q = shift:"Intersects(0 19 23
365)‖
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
22. © 2013 LucidWorks
22
Boldly go forth and rank!
• Faster
• More Flexible
• Easier than ever scaling
• More reliable than ever
• Reduced cost of experimentation
23. © 2013 LucidWorks
• Lucene/Solr EU
Conference:
- Dublin, IE, November 4-7:
http://lucenerevolution.org/
- CFP Open Now
Where to Next?
• Lucene/Solr
- http://lucene.apache.org
- {java-user|solr-user}@lucene.apache.org
- SIGIR ‗12 Open Source Workshop
» http://opensearchlab.otago.ac.nz/paper_
10.pdf
• LucidWorks
- http://www.lucidworks.com
- Commercial support, products, etc. for
Lucene/Solr
• Me
- grant@lucidworks.com
- @gsingers on Twitter
- ―Taming Text‖ – Engineer‘s guide to open
source search and NLP
» http:///www.manning.com/ingersoll
23
24. © 2013 LucidWorks
24
Credits
• All of the Lucene/Solr committers and contributors
• Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html
• Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1
• Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring-
The-American.jpg
• Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/
• Love: http://www.msruntheus.com/above-all-love-each-other-deeply/
• TARDIS: http://2.bp.blogspot.com/-
ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do
ctor_who.jpg
Editor's Notes Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more What is Lucene?What is Solr? Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions Oh, BTW, it can do search over the valuesKeys can be anything, not just strings