SlideShare a Scribd company logo
1 of 58
Analyze This!


                              Tom Hill
                              Lucid Imagination
                              Webinar 1/28/2010



    Lucid Imagination, Inc.
Analyze This!




                 Analysis
         Basics, Tips and Tools



                          Lucid Imagination, Inc.




Page 2
                                                    © 2010 Lucid Imagination, Inc.
Overview
         We’ll be covering:
           What is analysis, and why do you care?
           Some common problems with analysis
           Tools for troubleshooting
             Analyzer Tool
             Schema Browser
             Luke
           Existing Analyzers, Filters and Tokenizers
                                       Lucid Imagination, Inc.




Page 3
                                                                 © 2010 Lucid Imagination, Inc.
What is Analysis?

         • Converting your text into terms
              Solr does NOT search your text
              Solr searches the set of terms created by analysis
              Problems happen when the terms are not what you think they
              are




                                        Lucid Imagination, Inc.




Page 4
                                                                  © 2010 Lucid Imagination, Inc.
Examples

                       Don’t => dont

                       iPhone => i phone
                                   iphon
                       τα πρώτα δείγματα =>πρωτα δειγματα
                       The quick brown fox jumps => The quick brown fox jumps



                                               Lucid Imagination, Inc.




Page 5   © 2008-2009                                                     © 2010 Lucid Imagination, Inc.   5
Different Effects of Analysis
                There are many ways to analyze a run of text.
                       Break on whitespace, punctuation, caseChanges, numb3rs
                       Stemming (shoes -> shoe)
                       Removing/replacing unwanted words/symbols
                       Combining words
                       Adding new words (synonyms)
                       And many more


                                                  Lucid Imagination, Inc.




Page 6   © 2008-2009                                                        © 2010 Lucid Imagination, Inc.   6
Copy Fields                                                                                  1


              It’s common to want to index data more than one way
              You might store an analyzed version of a field for searching
                And store an unanalyzed version for faceting or sorting
              You might store a stemmed and non-stemmed version of a field
                To boost precise matches




                                           Lucid Imagination, Inc.




Page 7
                                                                     © 2010 Lucid Imagination, Inc.
Copy Fields                                                                                 2


              It’s also common to copy to a common destination field
                For example: “alltext”
              Note this copies from the SOURCE of the copied field
                Not the analyzed version of the copied field
              <copyField source="cat" dest="text"/>
               <copyField source="name" dest="text"/>
               <copyField source="manu" dest="text"/>

                                          Lucid Imagination, Inc.




Page 8
                                                                    © 2010 Lucid Imagination, Inc.
What could go wrong?

         • Lots of things
              You can’t find things
              You find too much
              Poor query or indexing performance




                                      Lucid Imagination, Inc.




Page 9
                                                                © 2010 Lucid Imagination, Inc.
Common Scenario #1

              Someone sets up Solr for the first time
              Adds some data
              Then posts to the mailing list, and says “why can’t I find my
              data?”
              The problem’s basic, but it’s useful to know how to identify it.




                                        Lucid Imagination, Inc.




Page 10
                                                                  © 2010 Lucid Imagination, Inc.
“When I Search For ‘fox’…”




                                       Lucid Imagination, Inc.




Page 11
                                                                 © 2010 Lucid Imagination, Inc.
“…I Find Nothing”




                              Lucid Imagination, Inc.




Page 12
                                                        © 2010 Lucid Imagination, Inc.
“But, If I look at the index”




                                          Lucid Imagination, Inc.




Page 13
                                                                    © 2010 Lucid Imagination, Inc.
“It’s right there”




                               Lucid Imagination, Inc.




Page 14
                                                         © 2010 Lucid Imagination, Inc.
Analysis Tool

               Your first stop for figuring out analysis problems




                                          Lucid Imagination, Inc.




Page 15
                                                                    © 2010 Lucid Imagination, Inc.
Analysis Tool




                          Lucid Imagination, Inc.




Page 16
                                                    © 2010 Lucid Imagination, Inc.
Analysis Tool Demo




                               Lucid Imagination, Inc.




Page 17
                                                         © 2010 Lucid Imagination, Inc.
Stored vs. Indexed

               Solr can store both analyzed and un-analyzed content
               But you knew that …
                 “stored” vs. “indexed” in the field definition
               How can you see what is actually indexed?
                 …that is, the terms you can search for




                                            Lucid Imagination, Inc.




Page 18
                                                                      © 2010 Lucid Imagination, Inc.
Schema Browser
              Schema Browser lets you examine the fields and how they are
              configured.
              It also allows you to examine the terms in the index




                                        Lucid Imagination, Inc.




Page 19
                                                                  © 2010 Lucid Imagination, Inc.
Schema Browser




                           Lucid Imagination, Inc.




Page 20
                                                     © 2010 Lucid Imagination, Inc.
Schema Browser




                           Lucid Imagination, Inc.




Page 21
                                                     © 2010 Lucid Imagination, Inc.
Schema Browser Demo




                                Lucid Imagination, Inc.




Page 22
                                                          © 2010 Lucid Imagination, Inc.
How Many of You Just Copied the Example Schema?

          • Just because it works for one person’s data, doesn’t mean it
            works for yours.
          • Take the time to look at the output




                                      Lucid Imagination, Inc.




Page 23
                                                                © 2010 Lucid Imagination, Inc.
Luke

                 Lucene Index Exploration Tool
                 Allows you to look at (and modify) the contents of an index




                                          Lucid Imagination, Inc.




Page 24
                                                                    © 2010 Lucid Imagination, Inc.
Luke Main Screen




                             Lucid Imagination, Inc.




Page 25
                                                       © 2010 Lucid Imagination, Inc.
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 26
                                                             © 2010 Lucid Imagination, Inc.
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 27
                                                             © 2010 Lucid Imagination, Inc.
Close-up from last slide

               solr null_1 enterpris search server
               null_100 apach softwar foundat null_100 softwar null_100 search
                 null_100 advanc
               full fulltext|text search capabl use
               lucen null_100 optim null_1 high …




                                             Lucid Imagination, Inc.




Page 28
                                                                       © 2010 Lucid Imagination, Inc.
Position Increment Gap

               The null_xxx entries are how luke represents the position
               increment between instances of multi-valued fields.
               The example had
               <field name=“text">Solr, the Enterprise Search Server</field>
               <field name=“text">Apache Software Foundation</field>
               Using a position increment prevents phrase queries from
               matching across different values of a field
               Without the gap “Server Apache” would be a valid phrase.

                                           Lucid Imagination, Inc.




Page 29
                                                                     © 2010 Lucid Imagination, Inc.
Analysis Can Affect Performance

               Analysis doesn’t just product success/failure on a search
               It can affect the query processing speed, too.




                                         Lucid Imagination, Inc.




Page 30
                                                                   © 2010 Lucid Imagination, Inc.
Slow Searches

              They index 500,000 books
              Multiple languages in one field
                So they can’t do stemming or stop words
              Their worst case query was:
              “The lives and literature of the beat generation”
              It took 2 minutes to run.
              The query requires checking every doc containing “the” & “and”
                And the position info for each occurrence
                                          Lucid Imagination, Inc.




Page 31
                                                                    © 2010 Lucid Imagination, Inc.
Bi-grams

              Bi-grams combine adjacent terms
              ““The lives and literature “ becomes
              “The lives” “lives and” “and literature”
              Only have to check documents that contain the pair adjacent to
              each other.
              Only have to look at position information for the pair
              But can triple the size of the index
                Word indexed by itself
                                         Lucid Imagination, Inc.
                Indexed both with preceding term, and following term



Page 32
                                                                   © 2010 Lucid Imagination, Inc.
Common Grams

              Form bi-grams only for common terms
              “The” occurs 2 billion times. “The lives” occurs 360k.
              Used the only 32 most common terms
              Average response went from 460 ms to 68ms.




                                        Lucid Imagination, Inc.




Page 33
                                                                  © 2010 Lucid Imagination, Inc.
Implied Phrase Queries

               Another example involved a query with “L’art”
               This turns into a phrase query, “L art” with the default config.
                 PhraseQuery(text:"l art")
               “Turning it into the single token ‘L art’ is much more efficient.
                 Occurs in far fewer documents that “L”
                 Is a term query, not a phrase query.



                                             Lucid Imagination, Inc.




Page 34
                                                                       © 2010 Lucid Imagination, Inc.
Multiple Languages

              Generally, we suggest keeping different languages in their own
              fields
              This lets you have an analyzer for each language
                Stemming, stop words, etc.
              If you don’t know the total number of languages, you can use
              dynamic fields.
                That allows you to accept them, but not to dynamically stem, etc.


                                          Lucid Imagination, Inc.




Page 35
                                                                    © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               What happens when parsing a query in Solr?
                You may have many fields, with different analyzers
                Which Analyzer gets used?




                                         Lucid Imagination, Inc.




Page 36
                                                                   © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               QueryParser splits the query
                 Understands quotes, parens and whitespace
               Gives the resulting pieces to the correct analyzer
                 Explicit or Default




                                         Lucid Imagination, Inc.




Page 37
                                                                   © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               To see what happens to your query
                 Use the “Full Interface” section of the admin interface
                   Check ‘debug: enable’
                 Or just add “&debugQuery=on” to the end of your query string
               We’re using the Lucene Query Parser
               Dismax does different things.


                                           Lucid Imagination, Inc.




Page 38
                                                                     © 2010 Lucid Imagination, Inc.
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 39
                                                                 © 2010 Lucid Imagination, Inc.
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 40
                                                                 © 2010 Lucid Imagination, Inc.
Query Examples

              title:foo bar
                Becomes: +title:foo +text:bar
                “foo” goes title field analyzer, bar to default field analyzer
              manu:”foo_bar baz”
                Becomes: manu:"foo bar baz“
                Note _ got removed. The whole string goes to manu analyzer
                Phrase query
              title: (foo bar)
                                            Lucid Imagination, Inc.

                Becomes: title:foo title:bar
                foo and bar passed separately to title’s analyzer

Page 41
                                                                      © 2010 Lucid Imagination, Inc.
Components of an Analyzer




                                      Lucid Imagination, Inc.




Page 42
                                                                © 2010 Lucid Imagination, Inc.
Components of an Analyzer

              CharFilters
              Tokenizers
              TokenFilters




                                      Lucid Imagination, Inc.




Page 43
                                                                © 2010 Lucid Imagination, Inc.
CharFilters

               Used to clean up/regularize characters before passing to
               TokenFilter
               Remove accents, etc. MappingCharFilter
               They can also do complex things, we’ll look at
               HTMLStripCharFilter later.




                                         Lucid Imagination, Inc.




Page 44
                                                                   © 2010 Lucid Imagination, Inc.
Tokenizers

               Convert text to tokens (terms)
               Only one per analyzer
               Many Options
                 WhitespaceTokenizer
                 StandardTokenizer
                 PatternTokenizer
                 More…

                                        Lucid Imagination, Inc.




Page 45
                                                                  © 2010 Lucid Imagination, Inc.
TokenFilters

               Process the tokens produced by the Tokenizer
               Can be many of them per field




                                       Lucid Imagination, Inc.




Page 46
                                                                 © 2010 Lucid Imagination, Inc.
Some example TokenFilters that come with Solr/Lucene

               There are way too many to list them all
               We’re just going to go through a few of them




                                        Lucid Imagination, Inc.




Page 47
                                                                  © 2010 Lucid Imagination, Inc.
Reversing Filter

               Why?
                 Leading wildcards require traversing the whole index
               Reverse the order, and leading wildcards become trailing
                 *cats => stac*
               Only have to check terms that start with stac, instead of the
               whole index.



                                          Lucid Imagination, Inc.




Page 48
                                                                    © 2010 Lucid Imagination, Inc.
Phonetic Analysis

               Creates a phonetic representation of the text, for “sounds like”
               matching
               PhoneticFilterFactory. Uses one of
                 Metaphone
                 Double Metaphone
                 Soundex
                 Refined Soundex

                                         Lucid Imagination, Inc.




Page 49
                                                                   © 2010 Lucid Imagination, Inc.
Synonyms

              Synonym filter allows you to include alternate words that the
              user can use when searching
              For example, theater, theatre
                Useful for movie titles, where words are deliberately mis-spelled
              Don’t over-use synonyms
                It helps recall, but lowers precision
              Produces tokens at the same token position
                “local theater company”
                       theatre         Lucid Imagination, Inc.




Page 50
                                                                 © 2010 Lucid Imagination, Inc.
HTML text extraction

               Removes html tags, attributes comments
               XML processing directives
               Removes <script> and <style> contents
               Replaces entities
               HtmlStripCharFilterFactory




                                           Lucid Imagination, Inc.




Page 51
                                                                     © 2010 Lucid Imagination, Inc.
Spell Checking

               Spell checker starts by analyzing the source terms into n-grams
               From the Lucene Wiki:




                                         Lucid Imagination, Inc.




Page 52
                                                                   © 2010 Lucid Imagination, Inc.
Spell Checking

               You don’t actually have to know that to use the spell checker
               But I think it’s kind of cool
               Use luke to explore the index generated by the spell checker.




                                               Lucid Imagination, Inc.




Page 53
                                                                         © 2010 Lucid Imagination, Inc.
And many more

              Regular expression Tokenizer
              Stemmers for many languages
               Persian, Hindi, Chinese, Japanese, etc.
               Third party/commercial stemmers available, too.
              SnowballPorterFilter




                                          Lucid Imagination, Inc.




Page 54
                                                                    © 2010 Lucid Imagination, Inc.
Recap

              If you can’t find it, and you are sure it’s there:
                  It’s likely an analysis problem
              Three main tools for troubleshooting analysis
                  Analysis tool
                  Schema browser
                  Luke
              Look at your index, documents and the output of your analyzers
              periodically.
                                             Lucid Imagination, Inc.




Page 55
                                                                       © 2010 Lucid Imagination, Inc.
Additional Resources

               Lucid Imagination Solr Reference Guide
                 LucidImagination.com/downloads
               Lucene in Action Second Edition
                 This isn’t published yet, but you can get the early access version
                 from manning.com/hatcher3
               http://www.hathitrust.org/blog
               Solr wiki on Analysis
                 Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
                                            Lucid Imagination, Inc.
               Luke - http://code.google.com/p/luke/



Page 56
                                                                      © 2010 Lucid Imagination, Inc.
Questions

              If we have time, we’ll take some questions




                                       Lucid Imagination, Inc.




Page 57
                                                                 © 2010 Lucid Imagination, Inc.
Thanks!
                Tom Hill
          LucidImagination.com



               Lucid Imagination, Inc.




Page 58
                                         © 2010 Lucid Imagination, Inc.

More Related Content

Viewers also liked

Presentation
PresentationPresentation
Presentationtarodnova
 
Is this love
Is this loveIs this love
Is this lovetanica
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solrLucidworks (Archived)
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagemgindri
 
Updated: You Have An Idea ... Do You Have A Business?
Updated: You Have An Idea ...  Do You Have A Business?Updated: You Have An Idea ...  Do You Have A Business?
Updated: You Have An Idea ... Do You Have A Business?Marty Kaszubowski
 
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Marty Kaszubowski
 
across the universe
across the universeacross the universe
across the universetanica
 
Building SaaS Solutions for Online Media Using Apache Solr
Building SaaS Solutions for Online Media Using Apache SolrBuilding SaaS Solutions for Online Media Using Apache Solr
Building SaaS Solutions for Online Media Using Apache SolrLucidworks (Archived)
 
Gaiety Hotel - full version
Gaiety Hotel - full versionGaiety Hotel - full version
Gaiety Hotel - full versiondummypackages
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Inglestanica
 

Viewers also liked (16)

Presentation
PresentationPresentation
Presentation
 
Is this love
Is this loveIs this love
Is this love
 
Metacognicion
MetacognicionMetacognicion
Metacognicion
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
Updated: You Have An Idea ... Do You Have A Business?
Updated: You Have An Idea ...  Do You Have A Business?Updated: You Have An Idea ...  Do You Have A Business?
Updated: You Have An Idea ... Do You Have A Business?
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValuesColumn Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
 
How To Get The Justin Bieber Smile
How To Get The Justin Bieber SmileHow To Get The Justin Bieber Smile
How To Get The Justin Bieber Smile
 
across the universe
across the universeacross the universe
across the universe
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
Building SaaS Solutions for Online Media Using Apache Solr
Building SaaS Solutions for Online Media Using Apache SolrBuilding SaaS Solutions for Online Media Using Apache Solr
Building SaaS Solutions for Online Media Using Apache Solr
 
Gaiety Hotel - full version
Gaiety Hotel - full versionGaiety Hotel - full version
Gaiety Hotel - full version
 
Customized Navigation Using SOLR
Customized Navigation Using SOLRCustomized Navigation Using SOLR
Customized Navigation Using SOLR
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Ingles
 

Similar to Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search applicationLucidworks (Archived)
 
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill lucenerevolution
 
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill lucenerevolution
 
Mining everyone's business: Customer data integration in a rich-data ecosystem
Mining everyone's business: Customer data integration in a rich-data ecosystemMining everyone's business: Customer data integration in a rich-data ecosystem
Mining everyone's business: Customer data integration in a rich-data ecosystemMoxie Insight
 
Rapid Innovative Design Notes
Rapid Innovative Design NotesRapid Innovative Design Notes
Rapid Innovative Design Notesspotlearning
 
Google Instant Impact on SEO
Google Instant Impact on SEOGoogle Instant Impact on SEO
Google Instant Impact on SEOOptify
 
Esm fy13 leadership
Esm fy13 leadershipEsm fy13 leadership
Esm fy13 leadershipCisco
 
Purposeful collaboration
Purposeful collaborationPurposeful collaboration
Purposeful collaborationAlan Lepofsky
 
Cloud Park Adventures 1
Cloud Park Adventures 1Cloud Park Adventures 1
Cloud Park Adventures 1CloudNSci
 
Addie For Job Searching
Addie For Job SearchingAddie For Job Searching
Addie For Job Searchingsherrymichaels
 
Embodied Cognition with Pproject Intu
Embodied Cognition with Pproject IntuEmbodied Cognition with Pproject Intu
Embodied Cognition with Pproject Intudiannepatricia
 
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...NAFCU Services Corporation
 
Trovus Slides Emap Construction Conference Feb10
Trovus Slides Emap Construction Conference Feb10Trovus Slides Emap Construction Conference Feb10
Trovus Slides Emap Construction Conference Feb10guest7addfc4
 
Big Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingBig Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingKevin Wheeler
 
It's All in the Cards
It's All in the CardsIt's All in the Cards
It's All in the Cardsa2gemma
 

Similar to Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right (18)

Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search application
 
The Seven Deadly Sins of Solr
The Seven Deadly Sins of SolrThe Seven Deadly Sins of Solr
The Seven Deadly Sins of Solr
 
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill
 
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill
 
Mining everyone's business: Customer data integration in a rich-data ecosystem
Mining everyone's business: Customer data integration in a rich-data ecosystemMining everyone's business: Customer data integration in a rich-data ecosystem
Mining everyone's business: Customer data integration in a rich-data ecosystem
 
Rapid Innovative Design Notes
Rapid Innovative Design NotesRapid Innovative Design Notes
Rapid Innovative Design Notes
 
Google Instant Impact on SEO
Google Instant Impact on SEOGoogle Instant Impact on SEO
Google Instant Impact on SEO
 
Esm fy13 leadership
Esm fy13 leadershipEsm fy13 leadership
Esm fy13 leadership
 
Purposeful collaboration
Purposeful collaborationPurposeful collaboration
Purposeful collaboration
 
ObserveIT Customer presentation
ObserveIT Customer presentation ObserveIT Customer presentation
ObserveIT Customer presentation
 
Cloud Park Adventures 1
Cloud Park Adventures 1Cloud Park Adventures 1
Cloud Park Adventures 1
 
Addie For Job Searching
Addie For Job SearchingAddie For Job Searching
Addie For Job Searching
 
Agile: Get Real
Agile: Get RealAgile: Get Real
Agile: Get Real
 
Embodied Cognition with Pproject Intu
Embodied Cognition with Pproject IntuEmbodied Cognition with Pproject Intu
Embodied Cognition with Pproject Intu
 
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...
Undaunted: How Credit Unions Can Thrive in the New Financial Services Environ...
 
Trovus Slides Emap Construction Conference Feb10
Trovus Slides Emap Construction Conference Feb10Trovus Slides Emap Construction Conference Feb10
Trovus Slides Emap Construction Conference Feb10
 
Big Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingBig Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of Sourcing
 
It's All in the Cards
It's All in the CardsIt's All in the Cards
It's All in the Cards
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

  • 1. Analyze This! Tom Hill Lucid Imagination Webinar 1/28/2010 Lucid Imagination, Inc.
  • 2. Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 © 2010 Lucid Imagination, Inc.
  • 3. Overview We’ll be covering: What is analysis, and why do you care? Some common problems with analysis Tools for troubleshooting Analyzer Tool Schema Browser Luke Existing Analyzers, Filters and Tokenizers Lucid Imagination, Inc. Page 3 © 2010 Lucid Imagination, Inc.
  • 4. What is Analysis? • Converting your text into terms Solr does NOT search your text Solr searches the set of terms created by analysis Problems happen when the terms are not what you think they are Lucid Imagination, Inc. Page 4 © 2010 Lucid Imagination, Inc.
  • 5. Examples Don’t => dont iPhone => i phone iphon τα πρώτα δείγματα =>πρωτα δειγματα The quick brown fox jumps => The quick brown fox jumps Lucid Imagination, Inc. Page 5 © 2008-2009 © 2010 Lucid Imagination, Inc. 5
  • 6. Different Effects of Analysis There are many ways to analyze a run of text. Break on whitespace, punctuation, caseChanges, numb3rs Stemming (shoes -> shoe) Removing/replacing unwanted words/symbols Combining words Adding new words (synonyms) And many more Lucid Imagination, Inc. Page 6 © 2008-2009 © 2010 Lucid Imagination, Inc. 6
  • 7. Copy Fields 1 It’s common to want to index data more than one way You might store an analyzed version of a field for searching And store an unanalyzed version for faceting or sorting You might store a stemmed and non-stemmed version of a field To boost precise matches Lucid Imagination, Inc. Page 7 © 2010 Lucid Imagination, Inc.
  • 8. Copy Fields 2 It’s also common to copy to a common destination field For example: “alltext” Note this copies from the SOURCE of the copied field Not the analyzed version of the copied field <copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> Lucid Imagination, Inc. Page 8 © 2010 Lucid Imagination, Inc.
  • 9. What could go wrong? • Lots of things You can’t find things You find too much Poor query or indexing performance Lucid Imagination, Inc. Page 9 © 2010 Lucid Imagination, Inc.
  • 10. Common Scenario #1 Someone sets up Solr for the first time Adds some data Then posts to the mailing list, and says “why can’t I find my data?” The problem’s basic, but it’s useful to know how to identify it. Lucid Imagination, Inc. Page 10 © 2010 Lucid Imagination, Inc.
  • 11. “When I Search For ‘fox’…” Lucid Imagination, Inc. Page 11 © 2010 Lucid Imagination, Inc.
  • 12. “…I Find Nothing” Lucid Imagination, Inc. Page 12 © 2010 Lucid Imagination, Inc.
  • 13. “But, If I look at the index” Lucid Imagination, Inc. Page 13 © 2010 Lucid Imagination, Inc.
  • 14. “It’s right there” Lucid Imagination, Inc. Page 14 © 2010 Lucid Imagination, Inc.
  • 15. Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page 15 © 2010 Lucid Imagination, Inc.
  • 16. Analysis Tool Lucid Imagination, Inc. Page 16 © 2010 Lucid Imagination, Inc.
  • 17. Analysis Tool Demo Lucid Imagination, Inc. Page 17 © 2010 Lucid Imagination, Inc.
  • 18. Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew that … “stored” vs. “indexed” in the field definition How can you see what is actually indexed? …that is, the terms you can search for Lucid Imagination, Inc. Page 18 © 2010 Lucid Imagination, Inc.
  • 19. Schema Browser Schema Browser lets you examine the fields and how they are configured. It also allows you to examine the terms in the index Lucid Imagination, Inc. Page 19 © 2010 Lucid Imagination, Inc.
  • 20. Schema Browser Lucid Imagination, Inc. Page 20 © 2010 Lucid Imagination, Inc.
  • 21. Schema Browser Lucid Imagination, Inc. Page 21 © 2010 Lucid Imagination, Inc.
  • 22. Schema Browser Demo Lucid Imagination, Inc. Page 22 © 2010 Lucid Imagination, Inc.
  • 23. How Many of You Just Copied the Example Schema? • Just because it works for one person’s data, doesn’t mean it works for yours. • Take the time to look at the output Lucid Imagination, Inc. Page 23 © 2010 Lucid Imagination, Inc.
  • 24. Luke Lucene Index Exploration Tool Allows you to look at (and modify) the contents of an index Lucid Imagination, Inc. Page 24 © 2010 Lucid Imagination, Inc.
  • 25. Luke Main Screen Lucid Imagination, Inc. Page 25 © 2010 Lucid Imagination, Inc.
  • 26. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 26 © 2010 Lucid Imagination, Inc.
  • 27. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 27 © 2010 Lucid Imagination, Inc.
  • 28. Close-up from last slide solr null_1 enterpris search server null_100 apach softwar foundat null_100 softwar null_100 search null_100 advanc full fulltext|text search capabl use lucen null_100 optim null_1 high … Lucid Imagination, Inc. Page 28 © 2010 Lucid Imagination, Inc.
  • 29. Position Increment Gap The null_xxx entries are how luke represents the position increment between instances of multi-valued fields. The example had <field name=“text">Solr, the Enterprise Search Server</field> <field name=“text">Apache Software Foundation</field> Using a position increment prevents phrase queries from matching across different values of a field Without the gap “Server Apache” would be a valid phrase. Lucid Imagination, Inc. Page 29 © 2010 Lucid Imagination, Inc.
  • 30. Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can affect the query processing speed, too. Lucid Imagination, Inc. Page 30 © 2010 Lucid Imagination, Inc.
  • 31. Slow Searches They index 500,000 books Multiple languages in one field So they can’t do stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run. The query requires checking every doc containing “the” & “and” And the position info for each occurrence Lucid Imagination, Inc. Page 31 © 2010 Lucid Imagination, Inc.
  • 32. Bi-grams Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair But can triple the size of the index Word indexed by itself Lucid Imagination, Inc. Indexed both with preceding term, and following term Page 32 © 2010 Lucid Imagination, Inc.
  • 33. Common Grams Form bi-grams only for common terms “The” occurs 2 billion times. “The lives” occurs 360k. Used the only 32 most common terms Average response went from 460 ms to 68ms. Lucid Imagination, Inc. Page 33 © 2010 Lucid Imagination, Inc.
  • 34. Implied Phrase Queries Another example involved a query with “L’art” This turns into a phrase query, “L art” with the default config. PhraseQuery(text:"l art") “Turning it into the single token ‘L art’ is much more efficient. Occurs in far fewer documents that “L” Is a term query, not a phrase query. Lucid Imagination, Inc. Page 34 © 2010 Lucid Imagination, Inc.
  • 35. Multiple Languages Generally, we suggest keeping different languages in their own fields This lets you have an analyzer for each language Stemming, stop words, etc. If you don’t know the total number of languages, you can use dynamic fields. That allows you to accept them, but not to dynamically stem, etc. Lucid Imagination, Inc. Page 35 © 2010 Lucid Imagination, Inc.
  • 36. Analysis And Query Parsing What happens when parsing a query in Solr? You may have many fields, with different analyzers Which Analyzer gets used? Lucid Imagination, Inc. Page 36 © 2010 Lucid Imagination, Inc.
  • 37. Analysis And Query Parsing QueryParser splits the query Understands quotes, parens and whitespace Gives the resulting pieces to the correct analyzer Explicit or Default Lucid Imagination, Inc. Page 37 © 2010 Lucid Imagination, Inc.
  • 38. Analysis And Query Parsing To see what happens to your query Use the “Full Interface” section of the admin interface Check ‘debug: enable’ Or just add “&debugQuery=on” to the end of your query string We’re using the Lucene Query Parser Dismax does different things. Lucid Imagination, Inc. Page 38 © 2010 Lucid Imagination, Inc.
  • 39. Seeing the results of query parsing Lucid Imagination, Inc. Page 39 © 2010 Lucid Imagination, Inc.
  • 40. Seeing the results of query parsing Lucid Imagination, Inc. Page 40 © 2010 Lucid Imagination, Inc.
  • 41. Query Examples title:foo bar Becomes: +title:foo +text:bar “foo” goes title field analyzer, bar to default field analyzer manu:”foo_bar baz” Becomes: manu:"foo bar baz“ Note _ got removed. The whole string goes to manu analyzer Phrase query title: (foo bar) Lucid Imagination, Inc. Becomes: title:foo title:bar foo and bar passed separately to title’s analyzer Page 41 © 2010 Lucid Imagination, Inc.
  • 42. Components of an Analyzer Lucid Imagination, Inc. Page 42 © 2010 Lucid Imagination, Inc.
  • 43. Components of an Analyzer CharFilters Tokenizers TokenFilters Lucid Imagination, Inc. Page 43 © 2010 Lucid Imagination, Inc.
  • 44. CharFilters Used to clean up/regularize characters before passing to TokenFilter Remove accents, etc. MappingCharFilter They can also do complex things, we’ll look at HTMLStripCharFilter later. Lucid Imagination, Inc. Page 44 © 2010 Lucid Imagination, Inc.
  • 45. Tokenizers Convert text to tokens (terms) Only one per analyzer Many Options WhitespaceTokenizer StandardTokenizer PatternTokenizer More… Lucid Imagination, Inc. Page 45 © 2010 Lucid Imagination, Inc.
  • 46. TokenFilters Process the tokens produced by the Tokenizer Can be many of them per field Lucid Imagination, Inc. Page 46 © 2010 Lucid Imagination, Inc.
  • 47. Some example TokenFilters that come with Solr/Lucene There are way too many to list them all We’re just going to go through a few of them Lucid Imagination, Inc. Page 47 © 2010 Lucid Imagination, Inc.
  • 48. Reversing Filter Why? Leading wildcards require traversing the whole index Reverse the order, and leading wildcards become trailing *cats => stac* Only have to check terms that start with stac, instead of the whole index. Lucid Imagination, Inc. Page 48 © 2010 Lucid Imagination, Inc.
  • 49. Phonetic Analysis Creates a phonetic representation of the text, for “sounds like” matching PhoneticFilterFactory. Uses one of Metaphone Double Metaphone Soundex Refined Soundex Lucid Imagination, Inc. Page 49 © 2010 Lucid Imagination, Inc.
  • 50. Synonyms Synonym filter allows you to include alternate words that the user can use when searching For example, theater, theatre Useful for movie titles, where words are deliberately mis-spelled Don’t over-use synonyms It helps recall, but lowers precision Produces tokens at the same token position “local theater company” theatre Lucid Imagination, Inc. Page 50 © 2010 Lucid Imagination, Inc.
  • 51. HTML text extraction Removes html tags, attributes comments XML processing directives Removes <script> and <style> contents Replaces entities HtmlStripCharFilterFactory Lucid Imagination, Inc. Page 51 © 2010 Lucid Imagination, Inc.
  • 52. Spell Checking Spell checker starts by analyzing the source terms into n-grams From the Lucene Wiki: Lucid Imagination, Inc. Page 52 © 2010 Lucid Imagination, Inc.
  • 53. Spell Checking You don’t actually have to know that to use the spell checker But I think it’s kind of cool Use luke to explore the index generated by the spell checker. Lucid Imagination, Inc. Page 53 © 2010 Lucid Imagination, Inc.
  • 54. And many more Regular expression Tokenizer Stemmers for many languages Persian, Hindi, Chinese, Japanese, etc. Third party/commercial stemmers available, too. SnowballPorterFilter Lucid Imagination, Inc. Page 54 © 2010 Lucid Imagination, Inc.
  • 55. Recap If you can’t find it, and you are sure it’s there: It’s likely an analysis problem Three main tools for troubleshooting analysis Analysis tool Schema browser Luke Look at your index, documents and the output of your analyzers periodically. Lucid Imagination, Inc. Page 55 © 2010 Lucid Imagination, Inc.
  • 56. Additional Resources Lucid Imagination Solr Reference Guide LucidImagination.com/downloads Lucene in Action Second Edition This isn’t published yet, but you can get the early access version from manning.com/hatcher3 http://www.hathitrust.org/blog Solr wiki on Analysis Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Lucid Imagination, Inc. Luke - http://code.google.com/p/luke/ Page 56 © 2010 Lucid Imagination, Inc.
  • 57. Questions If we have time, we’ll take some questions Lucid Imagination, Inc. Page 57 © 2010 Lucid Imagination, Inc.
  • 58. Thanks! Tom Hill LucidImagination.com Lucid Imagination, Inc. Page 58 © 2010 Lucid Imagination, Inc.