SlideShare a Scribd company logo
1 of 26
Thinking Lucene   Think Lucid




Bet You Didn’t Know Lucene Can…




Grant Ingersoll
Chief Scientist | Lucid Imagination
@gsingers


                                                    CONFIDENTIAL   |   1
A Funny Thing Happened On the Way To…



“Apache Lucene(TM) is a high-performance, full-featured text search engine
   library written entirely in Java. It is a technology suitable for nearly any
   application that requires full-text search, especially cross-platform.”


                                                - http://lucene.apache.org




                                                                      CONFIDENTIAL   |   2
What can Lucene solve?



   DB/NoSQL-like problems


   Search-like problems


   Stuff




                             CONFIDENTIAL   |   3
… Find your Keys?



   Lucene/Solr is a reasonably fast
    key-value store
     – Bonus: search your values!
   NoSQL before NoSQL was cool


   10 M doc index: 600,000 lookups
    per second, single threaded, read-
    only
     – Not hard to remove the read-only
       assumption or the single node
       assumption



                                          CONFIDENTIAL   |   4
…Store your Content?



   Solr or Tika + Lucene can index popular office formats
   Solr can backup/replicate and scale as content grows
   Commit/rollback functionality
   Can dynamically add fields
     – No schema required up front
   Retrieval is fast for keys or arbitrary text
   Trunk/4.x:
     – Column storage
     – Pluggable storage capabilities
     – Joins (a few variations)



                                                             CONFIDENTIAL   |   5
Thinking Lucene   Think Lucid




Search-like Problems




                                               CONFIDENTIAL   |   6
… Find you a Date?


                          Sex: Male
                          Seeking: Female
               Meet       Age: 53
               Bob        Job: Flute Repair shop owner
                          Location: Moose Jaw, Saskatchewan
                          Likes: rap music, cricket, long walks on the beach, Thai
                          food
                          Dislikes: classical music, cats

 Likes:     Rap music   Cricket        Long walks      Thai food
                                       on the
                                       beach

 Likes:     Rap music   Cricket        Long walks      Thai food
                                       on the
                                       beach
 Payload
                5            2              10

                                                                      CONFIDENTIAL   |   7
Along comes Mary


                                 Sex: Female
                                 Seeking: Male
                                 Age: 47
                 Meet Mary       Job: CEO
                                 Location: Moose Jaw, Saskatchewan
                                 Likes: Hip hop, sunsets, Korean food
                                 Dislikes: cats

             Filters                Queries

Sex, Seeking, Age (as               Likes: OR, Phrases, Payload
RangeQuery), Job, Location (as      Queries
spatial)
                                    Dislikes: As Not Queries or down
                                    boosted or perhaps ignore?
                                    Boosts: Popularity, Secret Sauce

                                                                        CONFIDENTIAL   |   8
Will Mary and Bob Find Love?




                          ?
CEO                 Owner, Chief Executive
                    Officer, Executive
Sunsets             Beaches, outdoors        Match

Korean Food         Asian Food
Age Range Match     Yes



                                                     CONFIDENTIAL   |   9
… Label Your Content?



   Given a new, unseen document, label it with one
    one or more predefined labels


   Supervised Machine Learning


   Train
     – Set of data annotated with predefined labels


   Test
     – Evaluate how well classifier can determine your
       content


                                                         CONFIDENTIAL   |   10
Simple Vector Space Classifiers

   K Nearest Neighbor (kNN)
     – Each Training Document indexed with id, category and
       text field
     – Pick Category based on whichever category has the most
       hits in the top K


   Simple TF-IDF (TFIDF)
     – Training                                                        Chapter 7
         • Index category and concatenation of all content with that
           label
     – Pick Category based on which ever document has best
       score


   Query: “Important” terms from new, unseen document
     – Use Lucene’s More Like This to generate the Query
                                                                            CONFIDENTIAL   |   11
Training Data



           Politics        Sports          Entertainment


                                               Spongebob
              Obama         Vikings win
                                                 caught
            fundraising     Super Bowl
                                               shoplifting


                             Carolina
            Republican     Hurricanes        Brangelina on a
            Fundraising     earn first          Rampage
                           Stanley Cup


           Obama clashes    Minnesota           Megastar
               with        Twins capture      clashes with
            Republicans    World Series        Paparazzi




                                                             CONFIDENTIAL   |   12
Simple TF-IDF Model

 Training
Politics                       Sports                     Entertainment
obama fundraising              vikings win super bowl     spongebob caught
republican fundraising         carolina hurricanes earn   shoplifting brangelina
obama clashes with             first stanley cup          rampage megastar
republicans                    minnesota twins capture    clashes paparazzi
                               world series


 Test/Production

     Input document is the query!

     e.g.: patriots lose super bowl




                                                                          CONFIDENTIAL   |   13
Help you Learn a New Language?


   Manu Konchady
    uses Lucene to
    teach new
    languages
   Find exactly where
    a match occurred


   Can also identify
    languages! (Solr)
   Analyzers can help
    you tokenize,
    stem, etc. many
    languages

                                 CONFIDENTIAL   |   14
… Detect Plagiarism?



   For each document
     – For each sentence
         • Index Sentence and calculate a hash for each
           document

   Hash function has property that similar
    sentences will hash to the same value
   For each new document
     – For each sentence
         • Query: hash (optionally also search for the
           sentence)

   Can also do this at the document level by             Contrib’d by Andrzej Bialecki
    calculating hash for whole document                   and Erik Hatcher



                                                                              CONFIDENTIAL   |   15
… Find the Bad Guys?



   Problem: Is Bob “Bad Guy” Johnson the same person as Robert William
    Johnson?
   Called Record Linkage or Entity Resolution
     – Common problem in business, finance, marketing, etc.
   Index contains all user profiles
   Ad hoc
     – Query: incoming user profile
     – Tricks: fuzzy queries, alternate queries
     – Post process results
   Systematic: pairwise similarity (More Like This for all docs)



                                                                    CONFIDENTIAL   |   16
…Make you more money?



   Who says a search needs to just do keyword matching using good old TF-
    IDF?


   Solr makes it easy to:
     – Rerank documents based on things like price, inventory, margin, popularity, etc.
     – Apply Business Rules
     – Hardcode results
     – Scale for the Holiday season




                                                                             CONFIDENTIAL   |   17
… Play Jeopardy!?



   Indeed, IBM Watson uses Lucene
   Critical component of Question Answering (QA) is often retrieval
   How to build a simple QA system?
     – Documents can be:
         • Whole text, paragraph, sentences
         • Position-based queries (spans) to find where keywords match
         • Index part of speech tags and possibly other analysis
     – Queries:
         • Classify based on Answer Type
         • Retrieve passages based on keywords plus answer type          Chapter 9
         • Score passages!




                                                                            CONFIDENTIAL   |   18
Thinking Lucene   Think Lucid




Stuff




                                        CONFIDENTIAL   |   19
… Make you a Better Programmer?

   If your tests aren’t failing from time to time, are you really doing enough
    testing?
   We’ve introduced some serious randomized testing
     – We run randomized tests every 30 minutes, ad infinitum
     – Random Locales, time zones, index file format, much, much more
     – Some in the community also randomize JVMs continuously


   We liked what we built so much, we now publish it as its own module
     – https://issues.apache.org/jira/browse/LUCENE-3492
     – https://github.com/carrotsearch/randomizedtesting


   More References at end of talk


                                                                         CONFIDENTIAL   |   20
… Run Circles Around Previous Versions of Lucene?



   Finite State Transducers


   Pluggable Indexing Models
     – Codecs

                                         http://bit.ly/dawid-weiss-lucene-rev
   Pluggable Scoring Models
     – BM25, Information based, others




                                                                    CONFIDENTIAL   |   21
Thinking Lucene   Think Lucid




Crazy Stuff




                                              CONFIDENTIAL   |   22
…Play Chess?!? – THOUGHT EXPERIMENT

   Well, maybe not play, but, could we help?
   Premise: Even though chess has a very large number of possibilities, most
    board positions have been played before
   Could you assist with real time analysis?
     – Index large collection of previously played games
   Document A
     – Sequence of all moves of the game
     – Metadata
     – Query: PrefixQuery of current board + Function
     – Results: Ranked list of moves most likely to lead to a win
   Alternatives: index board positions, subsequences of moves (n-grams)



                                                                      CONFIDENTIAL   |   23
What else?



   In case you haven’t noticed, Lucene can do a lot of things that are not
    “traditional search”




   I’d love to hear your use cases!




                                                                        CONFIDENTIAL   |   24
Resources


    http://lucene.apache.org


    @gsingers / grant@lucidimagination.com


    http://www.lucidimagination.com


    http://lucene.grantingersoll.com




                                              CONFIDENTIAL   |   25
References and Credits



   Unit Testing:
     – http://wiki.apache.org/lucene-java/RunningTests
     – Robert Muir:
       http://lucenerevolution.org/sites/default/files/test%20framework.pdf
     – Dawid Weiss’ Lucene Eurocon talk: http://bit.ly/vaxdUC


   Images:
     – Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/
     – Storage:
       http://www.flickr.com/photos/d_e_/7641738/sizes/m/in/photostream/




                                                                              CONFIDENTIAL   |   26

More Related Content

More from Grant Ingersoll

What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

More from Grant Ingersoll (16)

What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Bet you didn't know Lucene can...

  • 1. Thinking Lucene Think Lucid Bet You Didn’t Know Lucene Can… Grant Ingersoll Chief Scientist | Lucid Imagination @gsingers CONFIDENTIAL | 1
  • 2. A Funny Thing Happened On the Way To… “Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.” - http://lucene.apache.org CONFIDENTIAL | 2
  • 3. What can Lucene solve?  DB/NoSQL-like problems  Search-like problems  Stuff CONFIDENTIAL | 3
  • 4. … Find your Keys?  Lucene/Solr is a reasonably fast key-value store – Bonus: search your values!  NoSQL before NoSQL was cool  10 M doc index: 600,000 lookups per second, single threaded, read- only – Not hard to remove the read-only assumption or the single node assumption CONFIDENTIAL | 4
  • 5. …Store your Content?  Solr or Tika + Lucene can index popular office formats  Solr can backup/replicate and scale as content grows  Commit/rollback functionality  Can dynamically add fields – No schema required up front  Retrieval is fast for keys or arbitrary text  Trunk/4.x: – Column storage – Pluggable storage capabilities – Joins (a few variations) CONFIDENTIAL | 5
  • 6. Thinking Lucene Think Lucid Search-like Problems CONFIDENTIAL | 6
  • 7. … Find you a Date? Sex: Male Seeking: Female Meet Age: 53 Bob Job: Flute Repair shop owner Location: Moose Jaw, Saskatchewan Likes: rap music, cricket, long walks on the beach, Thai food Dislikes: classical music, cats Likes: Rap music Cricket Long walks Thai food on the beach Likes: Rap music Cricket Long walks Thai food on the beach Payload 5 2 10 CONFIDENTIAL | 7
  • 8. Along comes Mary Sex: Female Seeking: Male Age: 47 Meet Mary Job: CEO Location: Moose Jaw, Saskatchewan Likes: Hip hop, sunsets, Korean food Dislikes: cats Filters Queries Sex, Seeking, Age (as Likes: OR, Phrases, Payload RangeQuery), Job, Location (as Queries spatial) Dislikes: As Not Queries or down boosted or perhaps ignore? Boosts: Popularity, Secret Sauce CONFIDENTIAL | 8
  • 9. Will Mary and Bob Find Love? ? CEO Owner, Chief Executive Officer, Executive Sunsets Beaches, outdoors Match Korean Food Asian Food Age Range Match Yes CONFIDENTIAL | 9
  • 10. … Label Your Content?  Given a new, unseen document, label it with one one or more predefined labels  Supervised Machine Learning  Train – Set of data annotated with predefined labels  Test – Evaluate how well classifier can determine your content CONFIDENTIAL | 10
  • 11. Simple Vector Space Classifiers  K Nearest Neighbor (kNN) – Each Training Document indexed with id, category and text field – Pick Category based on whichever category has the most hits in the top K  Simple TF-IDF (TFIDF) – Training Chapter 7 • Index category and concatenation of all content with that label – Pick Category based on which ever document has best score  Query: “Important” terms from new, unseen document – Use Lucene’s More Like This to generate the Query CONFIDENTIAL | 11
  • 12. Training Data Politics Sports Entertainment Spongebob Obama Vikings win caught fundraising Super Bowl shoplifting Carolina Republican Hurricanes Brangelina on a Fundraising earn first Rampage Stanley Cup Obama clashes Minnesota Megastar with Twins capture clashes with Republicans World Series Paparazzi CONFIDENTIAL | 12
  • 13. Simple TF-IDF Model Training Politics Sports Entertainment obama fundraising vikings win super bowl spongebob caught republican fundraising carolina hurricanes earn shoplifting brangelina obama clashes with first stanley cup rampage megastar republicans minnesota twins capture clashes paparazzi world series Test/Production Input document is the query! e.g.: patriots lose super bowl CONFIDENTIAL | 13
  • 14. Help you Learn a New Language?  Manu Konchady uses Lucene to teach new languages  Find exactly where a match occurred  Can also identify languages! (Solr)  Analyzers can help you tokenize, stem, etc. many languages CONFIDENTIAL | 14
  • 15. … Detect Plagiarism?  For each document – For each sentence • Index Sentence and calculate a hash for each document  Hash function has property that similar sentences will hash to the same value  For each new document – For each sentence • Query: hash (optionally also search for the sentence)  Can also do this at the document level by Contrib’d by Andrzej Bialecki calculating hash for whole document and Erik Hatcher CONFIDENTIAL | 15
  • 16. … Find the Bad Guys?  Problem: Is Bob “Bad Guy” Johnson the same person as Robert William Johnson?  Called Record Linkage or Entity Resolution – Common problem in business, finance, marketing, etc.  Index contains all user profiles  Ad hoc – Query: incoming user profile – Tricks: fuzzy queries, alternate queries – Post process results  Systematic: pairwise similarity (More Like This for all docs) CONFIDENTIAL | 16
  • 17. …Make you more money?  Who says a search needs to just do keyword matching using good old TF- IDF?  Solr makes it easy to: – Rerank documents based on things like price, inventory, margin, popularity, etc. – Apply Business Rules – Hardcode results – Scale for the Holiday season CONFIDENTIAL | 17
  • 18. … Play Jeopardy!?  Indeed, IBM Watson uses Lucene  Critical component of Question Answering (QA) is often retrieval  How to build a simple QA system? – Documents can be: • Whole text, paragraph, sentences • Position-based queries (spans) to find where keywords match • Index part of speech tags and possibly other analysis – Queries: • Classify based on Answer Type • Retrieve passages based on keywords plus answer type Chapter 9 • Score passages! CONFIDENTIAL | 18
  • 19. Thinking Lucene Think Lucid Stuff CONFIDENTIAL | 19
  • 20. … Make you a Better Programmer?  If your tests aren’t failing from time to time, are you really doing enough testing?  We’ve introduced some serious randomized testing – We run randomized tests every 30 minutes, ad infinitum – Random Locales, time zones, index file format, much, much more – Some in the community also randomize JVMs continuously  We liked what we built so much, we now publish it as its own module – https://issues.apache.org/jira/browse/LUCENE-3492 – https://github.com/carrotsearch/randomizedtesting  More References at end of talk CONFIDENTIAL | 20
  • 21. … Run Circles Around Previous Versions of Lucene?  Finite State Transducers  Pluggable Indexing Models – Codecs http://bit.ly/dawid-weiss-lucene-rev  Pluggable Scoring Models – BM25, Information based, others CONFIDENTIAL | 21
  • 22. Thinking Lucene Think Lucid Crazy Stuff CONFIDENTIAL | 22
  • 23. …Play Chess?!? – THOUGHT EXPERIMENT  Well, maybe not play, but, could we help?  Premise: Even though chess has a very large number of possibilities, most board positions have been played before  Could you assist with real time analysis? – Index large collection of previously played games  Document A – Sequence of all moves of the game – Metadata – Query: PrefixQuery of current board + Function – Results: Ranked list of moves most likely to lead to a win  Alternatives: index board positions, subsequences of moves (n-grams) CONFIDENTIAL | 23
  • 24. What else?  In case you haven’t noticed, Lucene can do a lot of things that are not “traditional search”  I’d love to hear your use cases! CONFIDENTIAL | 24
  • 25. Resources  http://lucene.apache.org  @gsingers / grant@lucidimagination.com  http://www.lucidimagination.com  http://lucene.grantingersoll.com CONFIDENTIAL | 25
  • 26. References and Credits  Unit Testing: – http://wiki.apache.org/lucene-java/RunningTests – Robert Muir: http://lucenerevolution.org/sites/default/files/test%20framework.pdf – Dawid Weiss’ Lucene Eurocon talk: http://bit.ly/vaxdUC  Images: – Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/ – Storage: http://www.flickr.com/photos/d_e_/7641738/sizes/m/in/photostream/ CONFIDENTIAL | 26

Editor's Notes

  1. That’s the description of Lucene, but hey, it’s good for other things tooLet’s explore theseWe’ll start easy, then get into things that are mathematically similar to search and then talk some crazy stuff
  2. Oh, BTW, it can do search over the valuesKeys can be anything, not just strings
  3. Commit/rollback not totally the same as DB
  4. Lucene is a perfectly good content based recommendation engine. In fact, this can fall under the category of “search”Lots of flexibility around representing featureshttp://www.lucidimagination.com/search/document/5485be0137448eca/problems_with_itembasedrecommender_with_lucene#c82c577e1e28259f
  5. You remembered your synonyms and associations, right? Maybe bootstrap from Wordnet or other resource? Perhaps you even used Lucene to calculate co-occurencesYou can tweak the system as needed to come up w/ appropriate queries, etc.
  6. Let’s say you have a bunch of training data
  7. Pairwise similarity: compare all documents
  8. Scoring is easier said than done, but simple approach can be effective for fact-based questions