SlideShare a Scribd company logo
1 of 85
Download to read offline
Information Access with
     Apache Lucene
                              Pierpaolo Basile, PhD
     Research Assistant Univeristy Of Bari Aldo Moro
                      co-founder QuestionCube s.r.l.
         pierpaolo.basile@{uniba.it, questioncube.com}


             CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012
             S. VITO DEI NORMANNI (BR)
Outline
• Search Engine
  – Vector Space Model
  – Inverted Index
• Apache Lucene
  – Indexing
  – Searching
• Advanced topics
  – Relevance feedback
  – Document similarity
  – Apache Tika
SEARCH ENGINE
Searching

            Collection of
             documents



            User query



             Relevant
            documents
Information Retrieval

Information Retrieval (IR) is the area of study
concerned with searching for documents,
for information within documents, and
for metadata about documents, as well as that of
searching structured storage, relational databases,
and the World Wide Web.

                                         Wikipedia
Information Retrieval Process
            User feedback        User
                               Interface

                                    Query                                Docs
                                                                        Doc
                                                          Text        Doc
                            Text Operations                          Doc

 Ranked
documents                      Logic view
              Query
                                              Indexing
            Operations
                   Query                            Inverted Index

            Searching                          Index

                   Retrived documents

             Ranking
Information Retrieval Model
<D, Q, F, R(qi, dj)>
• D: document representation
• Q: query representation
• F: query/document representation framework
• R(qi, dj): ranking function
Bag-of-words representation
Document/query as unordered collection of words

John likes to watch movies. Mary likes too. John also likes to
watch football games.



 {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6,
 "football": 7, "games": 8, "Mary": 9, "too": 10}
Vector Space Model
Query/Document Framework
• Algebraic model
• Documents/queries are represented as vector
  in a geometric space
• Terms are vector components
Vector Space Model

     d j  ( w1, j , w2, j ,, wt , j )
     q  ( w1,q , w2,q ,, wt ,q )
Each vector component corresponds to a term

The component value wi,j is the weight of the term ti
in the document dj
Vector Space Model
dj=(hit:5, refresh:2)
q1=(refresh:1)
q2=(hit:1)      refresh




                                      dj

                     q1


                          q2               hit
Vector Space Model
Ranking function
• Vector similarity between query and
  document computed by cosine

                              d j q
  R(d j , q)  cos  
                             dj  q
Vector Space Model
dj=(hit:5, refresh:2)
q1=(refresh:1)
q2=(hit:1)      refresh




                                      dj

                     q1
                          θ1
                                θ2
                               q2          hit
Term-Document matrix

 • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’
 • ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number
   One,’ said Alice.




 • Alice watched the White King as he slowly struggled up from bar to bar, till at
   last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate.
 • Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a
   good deal!’ was her first remark.



 • In the pool of light was a billiards table, with two figures moving around it. Alice
   walked toward them, and as she approached they turned to look at her.
 • Alice lay back, and closed her eyes. There was the Red Queen again, with that
   incessant grin. Or was it the Cheshire cat's grin?
Term-Document matrix
                 D1      D2      D3
Cheshire Cat     1       0       1
Alice            2       2       2
book             1       0       0
King             1       1       0
table            0       1       1
Queen            0       1       1
grin             0       0       2
Term-Document matrix
                  D1        D2   D3
Cheshire Cat      1         0    1
Alice             2         2    2
book              1         0    0
King              1         1    0
table             0         1    1
Queen             0         1    1
grin              0         0    2

                        0
Query:                  1
                        0
Alice AND Queen
                        0
                        0
                        1
                        0
Inverted Index
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary     Posting List
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index
Dictionary     Posting List                 Position List
Alice        D1:2 D2:2 D3:2   14, 49        1, 40           18, 34
…
book         D1:1             33
…
Cheshire Cat D1:1 D3:1
…
grin         D3:2
…                                      Term positions in D1
King         D1:1 D2:1
…
Queen        D2:1 D3:1
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              AND (intersection )
Cheshire Cat D1:1 D3:1
…                                  D1:2 D2:2 D3:2
grin         D3:2
                                  <Alice><King>
…
                                     D1:1 D2:1
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice AND King -> (D1, D2)
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              OR (union )
Cheshire Cat D1:1 D3:1
…                                      D3:2
grin         D3:2
                                  <grin>  <King>
…
                                     D1:1 D2:1
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice AND King -> (D1, D2, D3)
…
table        D2:1 D3:1
…
Inverted Index: query processing
Dictionary
Alice        D1:2 D2:2 D3:2
…
book         D1:1
…                              NOT (complement )
Cheshire Cat D1:1 D3:1
…                                   D1:2 D2:2 D3:2
grin         D3:2
                                  <Alice>  <grin>
…
                                        D3:2
King         D1:1 D2:1
…
Queen        D2:1 D3:1        Alice NOT grin -> (D1, D2)
…
table        D2:1 D3:1
…
Term-weight
• Measures the term relevance in a document
  – component value in the document representation
  – TF*IDF
     • TF (term frequency): term occurrences in the document
     • IDF (inverse document frequency): inverse to the
       number of documents in which the term occurs


                                                  D
   tf * idf (t , d )  tf (t , d ) * log
                                           {d  D : t  d }
Term-weight
• Measures the term relevance in a document
  – component value in the document representation
  – TF*IDF
     • TF (term frequency): term occurrences in the document
     • IDF (inverse document frequency): inverse to the
       number of documents in which the term occurs
                                                        Number of
                                                        documents in
                                                  D     the collection
   tf * idf (t , d )  tf (t , d ) * log
                                           {d  D : t  d }
                                                  idf
TF*IDF insights
• increases proportionally to the term
  frequency in the document
• decreases to the number of documents in
  which the term belongs
  – common words are generally more frequent in the
    collection
• IDF depends on the collection, TF on the
  document
Inverted index/TF*IDF
• TF: computed by term occurrences in the
  posting list
• IDF: computed by the posting list cardinality

  Alice   D1:2 D2:2 D3:2   |D|=100

                           TF
                                   100
  tf * idf ( Alice , D1)  2 * log
                                    3
Inverted index/TF*IDF
• TF: computed by term occurrences in the
  posting list
• IDF: computed by the posting list cardinality

  Alice   D1:2 D2:2 D3:2   |D|=100

                           IDF
                                   100
  tf * idf ( Alice , D1)  2 * log
                                    3
http://lucene.apache.org
APACHE LUCENE
Apache Lucene
• Apache Software Foundation project
  – http://lucene.apache.org
• What Lucene is
  – Search library: indexing and searching Application
    Programming Interface (API)
• What Lucene is not
  – Search engine (no crawling, server, user interface,
    etc.)
Lucene
•   Full-text search
•   High performance
•   Scalable
•   Cross-platform
•   100%-pure Java
•   Porting in other languages: .NET, Python
Information Retrieval Process
            User feedback        User
                               Interface

                                    Query
                                                          Text       Doc
                            Text Operations
 Ranked
documents                      Logic view
              Query
                                              Indexing
            Operations
                   Query                            Inverted Index

            Searching                          Index

                   Retrived documents

             Ranking
Information Retrieval Process
             User feedback           User
                                   Interface
                                                           org.apache.lucene
                                        Query
                                                                Text       Document
                           Analyzer/Tokenizer/Filter
 Ranked
documents                         Logic view

            QueryParser                          IndexWriter

                     Query                                Inverted Index

                                   IndexReader
            IndexSearcher                              Index

                     Retrived documents

              Similarity
Lucene Model 1/2
• Based on Vector Space Model
• Features:
  –   multi-field document
  –   tf–term frequency: number of matching terms in field
  –   lengthNorm: number of tokens in field
  –   idf: inverse document frequency
  –   coord: coordination factor, number of matching
• Terms
  – field boost
  – query clause boost
Lucene Model 1/2




• coor(q,d) score factor based on how many of the query terms are
  found in the specified document
• queryNorm(q,d) is a normalizing factor used to make scores
  between queries comparable
• boost(t) term boost
• norm(t,d) encapsulates a few (indexing time) boost and length
  factors
Lucene

         TopDocs
Document Indexing
FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, config);
Document doc = new Document();
doc.add(new Field("super_name", "Sandman", Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
Field Options
• Field.Store
  – NO: field value is not stored
  – YES: field value is stored
• Field.Index
  – ANALYZED: field is analyzed by the analyzer
  – NO: not indexed
  – NOT_ANALYZED: indexed but not analyzed
Luke
• Lucene diagnostic tool: http://code.google.com/p/luke/
Document: Fields Representation
http://www.bbc.co.uk/news/technology-15365207
Document: Fields Representation
http://www.bbc.co.uk/news/technology-15365207                    URL: Index.Store.YES,
                                   Data: Index.Store.YES,          Field.Index.NO
                                Field.Index.NOT_ANALYZED
                                            Authors: Index.Store.YES,
                                             Field.Index.ANALYZED

                                                          Title: Index.Store.NO,
                                                          Field.Index.ANALYZED
                                                          Abstract: Index.Store.NO,
                                                           Field.Index.ANALYZED




   Content: Index.Store.NO,
    Field.Index.ANALYZED                                 Comments: Index.Store.NO,
                                                           Field.Index.ANALYZED
Text Operations
• Tokenization
  – split text in token
• Stop word elimination
  – remove common words and closed word class
    (e.g. function words)
• Stemming
  – reducing inflected (or sometimes derived) words
    to their stem
Analyzers
• KeywordAnalyzer
   – Returns a single token
• WhitespaceAnalyzer
   – Splits tokens at whitespace
• Simple Analyzer
   – Divides text at nonletter characters and lowercases
• StopAnalyzer
   – Divides text at non letter characters, lowercases, and removes
     stop words
• StandardAnalyzer
   – Tokenizes based on sophisticated grammar that recognizes e-
     mail addresses, acronyms, etc.; lowercases and removes stop
     words (optional)
Analyzers
    The quick brown fox jumped over the lazy dogs
•   WhitespaceAnalyzer:
    [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
•   SimpleAnalyzer:
    [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
•   StopAnalyzer:
    [quick] [brown] [fox] [jumped] [lazy] [dogs]
•   StandardAnalyzer:
    [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzers
  "XY&Z Corporation - xyz@example.com"
• WhitespaceAnalyzer:
  [XY&Z] [Corporation] [-] [xyz@example.com]
• SimpleAnalyzer:
  [xy] [z] [corporation] [xyz] [example] [com]
• StandardAnalyzer:
  [xy&z] [corporation] [xyz@example.com]
Tokenizers
• A Tokenizer is a TokenStream whose input is a Reader
• Tokenizers break field text into tokens
• StandardTokenizer
   – source string: “full-text lucene.apache.org”
   – “full” “text” “lucene.apache.org”
• WhitespaceTokenizer
   – “full-text” “lucene.apache.org”
• LetterTokenizer
   – “full” “text” “lucene” “apache” “org”
TokenFilters
• A TokenFilter is a TokenStream whose input is
  another token stream
• LowerCaseFilter
• StopFilter
• LengthFilter
• PorterStemFilter
   – stemming: reducing words to root form
   – rides, ride, riding => ride
   – country, countries => countri
Analyzer
public class MyAnalyzer2 extends Analyzer {
private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new
    ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET));


    @Override
    public TokenStream tokenStream(String string, Reader reader) {
        TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader);
        ts = new LowerCaseFilter(Version.LUCENE_36, ts);
        ts = new StopFilter(Version.LUCENE_36, ts, sw);
        return ts;
    }
}
TermVectors
• Store information about term frequency,
  positions and offsets
• Relevance Feedback and “More Like This”
• Clustering
• Similarity between two documents
• Highlighter
  – needs offsets info
Lucene TermVector
• TermFreqVector is a representation of all of the
  terms and term counts in a specific Field of a
  Document instance
• As a tuple:
   – termFreq = <term, term countD>
   – <fieldName, <…,termFreqi, termFreqi+1,…>>
• As Java:
   – public String getField();
   – public String[] getTerms();
   – public int[] getTermFrequencies()
Lucene TermPositionVector
• TermPositionVector is a representation of all
  of the terms, term counts, term position and term
  offsets in a specific Field of a Document instance
• As Java:
   –   public String getField();
   –   public String[] getTerms();
   –   public int[] getTermFrequencies()
   –   public int[] getTermPositions(int index)
   –   public TermVectorOffsetInfo[]
       getOffsets(int index)
Store TermVector
• During indexing, create a Field that stores Term
• Vectors:
   – new
     Field(“text",text,Field.Store.YES,Field.Index.A
     NALYZED,Field.TermVector.YES);
• Options are:
   –   Field.TermVector.YES
   –   Field.TermVector.NO
   –   Field.TermVector.WITH_POSITIONS
   –   Field.TermVector.WITH_OFFSETS
   –   Field.TermVector.WITH_POSITIONS_OFFSETS
Term Vector: Positions and Offsets


       Positions

   0           1            2           3              4              5        6           7           8
The quick brown fox                              jumped over the lazy dogs
0<--->3    4<----->9   10<----->15   16<--->19   20 <------> 26   27<---->31 32<--->35 36<---->40   41<---->45


                                                                                                      Offsets
Example
• Main method with three parameters
   – Input_dir: directory to index
   – Index_dir: directory containing the index files
   – Flag (string) corresponding to different
     Field.TermVector... options
      •   t: tokenize
      •   tv: term vectors
      •   tvp: term vectors with positions
      •   tvo: term vectors with positions and offsets
• The constructor of the Index class takes in input
  the three parameters and indexes the directory
   – Filters only “.txt” files
Lucene Search
• IndexReader: interface for accessing an index
• QueryParser: parses the user query
• IndexSearcher: implements search over a
  single IndexReader
  – search(Query query, int numDoc)
  – TopDocs -> result of search
     • TopDocs.scoreDocs returns an array of retrieved
       documents
Lucene Search
Directory dir = FSDirectory.open(new File(args[0]));
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query query = parser.parse(args[1]);
TopDocs topDocs = searcher.search(query, 100);
System.out.println("matches: " + topDocs.totalHits);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
    Document doc = reader.document(topDocs.scoreDocs[i].doc);
    System.out.println("name: " + doc.get("name") + ", score: " +
      topDocs.scoreDocs[i].score);
    System.out.println();
}
searcher.close();
Lucene QueryParser
• Example:
  – queryParser.parse(“name:SpiderMan”);
• good human entered queries, debugging
• does text analysis and constructs appropriate
  queries
• not all query types supported
QUERY SYNTAX 1/2
• title:"The Right Way" AND text:go
    – Phrase query + term query
• Wildcard Searches
    – tes* (test - tests - tester)
    – te?t (test – text)
    – te*t (tempt)
    – tes?
• Fuzzy Searches (Levenshtein Distance) (TERM)
    – roam~ (foam – roams)
    – roam~0.8
• Range Searches
    – mod_date:[20020101 TO 20030101]
    – title:{Aida TO Carmen}
• Proximity Searches (PHRASE)
    – "jakarta apache"~10
QUERY SYNTAX 2/2
• Boosting a Term
    – jakarta^4 apache
    – "jakarta apache"^4 "Apache Lucene”
• Boolean Operator
    – NOT, OR, AND
    – + required operator
    – - prohibit operator
•   Grouping by ( )
•   Field Grouping
•   title:(+return +"pink panther")
•   Escaping Special Characters by
QUERY (BY CODE) 1/2
TermQuery query =
new TermQuery(new Term(“name”,”Spider-Man”))
• explicit, no escaping necessary
• does not do text analysis for you
• Query
  –   TermQuery
  –   BooleanQuery
  –   PhraseQuery
  –   FuzzyQuery / WildcardQuery /
QUERY (BY CODE) 2/2
TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))

TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))

BooleanQuery bq = new BooleanQuery();

bq.add(tq1, BooleanClause.Occur.MUST);

bq.add(tq2, BooleanClause.Occur.SHOULD);
DELETING DOCUMENTS
IndexWriter
  – deleteDocuments(Term t)
  – deleteDocuments(Term… t)
  – deleteDocuments(Query t)
  – deleteDocuments(Query… t)
  – updateDocument(Term t, Document d)
  – updateDocument(Term t, Document d,
    Analyzer a)
  – deleting does not immediately reclaim space
Relevance feedback
Documents similarity
Apache Tika

ADVANCED TOPICS
Relevance feedback
• Improve retrieval performance using
  information about document relevance
  – Explicit: relevance indicated by the user (assessor)
  – Implicit: from user behavior
  – Blind (or pseudo): using information about the top
    k retrieved documents
Relevance feedback

User query

                                                  New
                                                retrieved
                                   Searcher    documents




  Retrieved   Relevance feedback
 documents                         New query
Rocchio Algorithm
                  1                1             
Qnew    Q           Dj            Dk
                   Drel D j Drel    Dnorel Dk Dnorel

      Original query

                       Relevant document
                                           Non-relevant document
Rocchio Algorithm (blind)
                  1                1             
Qnew    Q           Dj            Dk
                   Drel D j Drel    Dnorel Dk Dnorel



 In blind (pseudo) relevance feedback we know only
 relevant documents (supposed to be the top K
 documents)
                      (generally adopted by search engine)
Relevance feedback
   Q = Alice                                   Qnew = Alice Cheshire Cat King
                                               book
It’s a friend of mine — a Cheshire             It’s a friend of mine — a Cheshire
Cat,’ said Alice: ‘allow me to                 Cat,’ said Alice: ‘allow me to
introduce it.                                  introduce it.
It’s the oldest rule in the book,’             It’s the oldest rule in the book,’
said the King. ‘Then it ought to               said the King. ‘Then it ought to
be Number One,’ said Alice                     be Number One,’ said Alice

Alice looked round eagerly, and found          Alice book was reprinted and
that it was the Red Queen. ‘She’s grown a      published in 1866.
good deal!’ was her first remark.
                                               …
Alice lay back, and closed her eyes. There
was the Red Queen again, with that
incessant grin. Or was it the Cheshire cat's
grin?
Relevance feedback in Lucene
• Possible using TermVector
  – Retrieve the top K documents
  – Build the Qnew using TermVector
  – Re-query using Qnew
Document similarity
• Documents are represented by vectors
  – Build the document vector using TermVector
  – Compute the similarity using cosine similarity
• Implement a functionality like “similar to this
  document”
Apache Tika
• Extract metadata and text from several file
  formats
  – XML, HTML, OpenOffice, Microsoft Office, PDF
  – MBOX format (email)
  – Compressed files
  – EXIF metadata from image
  – Metadata from audio file (MP3)
• Apache project as Lucene
Extract metadata and text
Tika tika = new Tika();
String type = tika.detect(new File(args[0]));
System.out.println("File type: " + type);
Metadata metadata = new Metadata();
String text = tika.parseToString(new FileInputStream(new File(args[0])),
     metadata);
System.out.println("Metadata");
String[] names = metadata.names();
for (int i = 0; i < names.length; i++) {
  String value = metadata.get(names[i]);
  if (value != null) {
      System.out.println(names[i] + "=" + value);
  }
}
System.out.println("Text: " + text);
Apache Tika + Lucene

Files/URLs



              metadata


  Tika                   Lucene

                text
CONCLUSIONS
Conclusions
• A popular IR model
  – Vector Space Model
• Lucene
  – provides API to build search engine
• Apace Tika
  – extracts metadata and text from files and URLs
• LET’S GO TO BUILD YOUR SEARCH ENGINE!
Lucene Contrib
• analyzers: new analyzer
  – Arabic, Chinese, …
• highlighter: a set of classes for highlighting
  matching terms in search results
• Snowball: stemmer analyzer based on
  Snowball library http://snowball.tartarus.org
• spellchecker: tools for spellchecking and
  suggestions with Lucene
Lucene related project
• Apache Solr: open source enterprise search
  platform from the Apache Lucene
  – Full-Text Search Capabilities, High Volume Web Traffic,
    HTML Administration Interfaces




• Apache Nutch: web-search engine based on Solr
  and Lucene
  – crawler, link-graph database, HTML
Thank you for your attention!
         Questions?

More Related Content

More from QIRIS

follow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightfollow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightQIRIS
 
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...QIRIS
 
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications QIRIS
 
follow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGfollow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGQIRIS
 
follow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successofollow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successoQIRIS
 
follow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryfollow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryQIRIS
 
follow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatofollow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatoQIRIS
 
follow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessfollow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessQIRIS
 
follow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Salesfollow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & SalesQIRIS
 
follow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utilifollow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utiliQIRIS
 
follow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderfollow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderQIRIS
 
follow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOfollow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOQIRIS
 
follow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditorefollow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditoreQIRIS
 
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)QIRIS
 

More from QIRIS (14)

follow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlightfollow-app BOOTCAMP 2: Introduction to silverlight
follow-app BOOTCAMP 2: Introduction to silverlight
 
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
follow-app BOOTCAMP 2: Building windows phone applications with visual studio...
 
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
follow-app BOOTCAMP 2 - Windows Phone: Tiles and Notifications
 
follow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUGfollow-app: BOOTCAMP 3 - Introduzione al GTUG
follow-app: BOOTCAMP 3 - Introduzione al GTUG
 
follow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successofollow-app DAY 4: Dati, segreti e tecniche per App di successo
follow-app DAY 4: Dati, segreti e tecniche per App di successo
 
follow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQueryfollow-ap DAY 4: HTML5 e jQuery
follow-ap DAY 4: HTML5 e jQuery
 
follow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercatofollow-app DAY 2: Dall'idea al mercato
follow-app DAY 2: Dall'idea al mercato
 
follow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al businessfollow-app DAY 2: Dal mercato al business
follow-app DAY 2: Dal mercato al business
 
follow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Salesfollow-app DAY 3: Marketing & Sales
follow-app DAY 3: Marketing & Sales
 
follow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utilifollow-app DAY 2: Risorse utili
follow-app DAY 2: Risorse utili
 
follow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leaderfollow-app DAY 1: Manager e leader
follow-app DAY 1: Manager e leader
 
follow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPOfollow-app DAY 1: Facebook IPO
follow-app DAY 1: Facebook IPO
 
follow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditorefollow-app DAY 1: Cosa vuol dire essere imprenditore
follow-app DAY 1: Cosa vuol dire essere imprenditore
 
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)dbGLOVE (presentation at Silicon Valley Personal Health Technology)
dbGLOVE (presentation at Silicon Valley Personal Health Technology)
 

Recently uploaded

Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 

Recently uploaded (20)

Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 

[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene

  • 1. Information Access with Apache Lucene Pierpaolo Basile, PhD Research Assistant Univeristy Of Bari Aldo Moro co-founder QuestionCube s.r.l. pierpaolo.basile@{uniba.it, questioncube.com} CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012 S. VITO DEI NORMANNI (BR)
  • 2. Outline • Search Engine – Vector Space Model – Inverted Index • Apache Lucene – Indexing – Searching • Advanced topics – Relevance feedback – Document similarity – Apache Tika
  • 4. Searching Collection of documents User query Relevant documents
  • 5. Information Retrieval Information Retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. Wikipedia
  • 6. Information Retrieval Process User feedback User Interface Query Docs Doc Text Doc Text Operations Doc Ranked documents Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  • 7. Information Retrieval Model <D, Q, F, R(qi, dj)> • D: document representation • Q: query representation • F: query/document representation framework • R(qi, dj): ranking function
  • 8. Bag-of-words representation Document/query as unordered collection of words John likes to watch movies. Mary likes too. John also likes to watch football games. {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}
  • 9. Vector Space Model Query/Document Framework • Algebraic model • Documents/queries are represented as vector in a geometric space • Terms are vector components
  • 10. Vector Space Model d j  ( w1, j , w2, j ,, wt , j ) q  ( w1,q , w2,q ,, wt ,q ) Each vector component corresponds to a term The component value wi,j is the weight of the term ti in the document dj
  • 11. Vector Space Model dj=(hit:5, refresh:2) q1=(refresh:1) q2=(hit:1) refresh dj q1 q2 hit
  • 12. Vector Space Model Ranking function • Vector similarity between query and document computed by cosine d j q R(d j , q)  cos   dj  q
  • 13. Vector Space Model dj=(hit:5, refresh:2) q1=(refresh:1) q2=(hit:1) refresh dj q1 θ1 θ2 q2 hit
  • 14. Term-Document matrix • ‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’ • ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice. • Alice watched the White King as he slowly struggled up from bar to bar, till at last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate. • Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark. • In the pool of light was a billiards table, with two figures moving around it. Alice walked toward them, and as she approached they turned to look at her. • Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
  • 15. Term-Document matrix D1 D2 D3 Cheshire Cat 1 0 1 Alice 2 2 2 book 1 0 0 King 1 1 0 table 0 1 1 Queen 0 1 1 grin 0 0 2
  • 16. Term-Document matrix D1 D2 D3 Cheshire Cat 1 0 1 Alice 2 2 2 book 1 0 0 King 1 1 0 table 0 1 1 Queen 0 1 1 grin 0 0 2 0 Query: 1 0 Alice AND Queen 0 0 1 0
  • 18. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 19. Inverted Index Dictionary Posting List Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 20. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 21. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 22. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 23. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 24. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 25. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 26. Inverted Index Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … Cheshire Cat D1:1 D3:1 … grin D3:2 … King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 27. Inverted Index Dictionary Posting List Position List Alice D1:2 D2:2 D3:2 14, 49 1, 40 18, 34 … book D1:1 33 … Cheshire Cat D1:1 D3:1 … grin D3:2 … Term positions in D1 King D1:1 D2:1 … Queen D2:1 D3:1 … table D2:1 D3:1 …
  • 28. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … AND (intersection ) Cheshire Cat D1:1 D3:1 … D1:2 D2:2 D3:2 grin D3:2 <Alice><King> … D1:1 D2:1 King D1:1 D2:1 … Queen D2:1 D3:1 Alice AND King -> (D1, D2) … table D2:1 D3:1 …
  • 29. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … OR (union ) Cheshire Cat D1:1 D3:1 … D3:2 grin D3:2 <grin>  <King> … D1:1 D2:1 King D1:1 D2:1 … Queen D2:1 D3:1 Alice AND King -> (D1, D2, D3) … table D2:1 D3:1 …
  • 30. Inverted Index: query processing Dictionary Alice D1:2 D2:2 D3:2 … book D1:1 … NOT (complement ) Cheshire Cat D1:1 D3:1 … D1:2 D2:2 D3:2 grin D3:2 <Alice> <grin> … D3:2 King D1:1 D2:1 … Queen D2:1 D3:1 Alice NOT grin -> (D1, D2) … table D2:1 D3:1 …
  • 31. Term-weight • Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs D tf * idf (t , d )  tf (t , d ) * log {d  D : t  d }
  • 32. Term-weight • Measures the term relevance in a document – component value in the document representation – TF*IDF • TF (term frequency): term occurrences in the document • IDF (inverse document frequency): inverse to the number of documents in which the term occurs Number of documents in D the collection tf * idf (t , d )  tf (t , d ) * log {d  D : t  d } idf
  • 33. TF*IDF insights • increases proportionally to the term frequency in the document • decreases to the number of documents in which the term belongs – common words are generally more frequent in the collection • IDF depends on the collection, TF on the document
  • 34. Inverted index/TF*IDF • TF: computed by term occurrences in the posting list • IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 TF 100 tf * idf ( Alice , D1)  2 * log 3
  • 35. Inverted index/TF*IDF • TF: computed by term occurrences in the posting list • IDF: computed by the posting list cardinality Alice D1:2 D2:2 D3:2 |D|=100 IDF 100 tf * idf ( Alice , D1)  2 * log 3
  • 37. Apache Lucene • Apache Software Foundation project – http://lucene.apache.org • What Lucene is – Search library: indexing and searching Application Programming Interface (API) • What Lucene is not – Search engine (no crawling, server, user interface, etc.)
  • 38. Lucene • Full-text search • High performance • Scalable • Cross-platform • 100%-pure Java • Porting in other languages: .NET, Python
  • 39. Information Retrieval Process User feedback User Interface Query Text Doc Text Operations Ranked documents Logic view Query Indexing Operations Query Inverted Index Searching Index Retrived documents Ranking
  • 40. Information Retrieval Process User feedback User Interface org.apache.lucene Query Text Document Analyzer/Tokenizer/Filter Ranked documents Logic view QueryParser IndexWriter Query Inverted Index IndexReader IndexSearcher Index Retrived documents Similarity
  • 41. Lucene Model 1/2 • Based on Vector Space Model • Features: – multi-field document – tf–term frequency: number of matching terms in field – lengthNorm: number of tokens in field – idf: inverse document frequency – coord: coordination factor, number of matching • Terms – field boost – query clause boost
  • 42. Lucene Model 1/2 • coor(q,d) score factor based on how many of the query terms are found in the specified document • queryNorm(q,d) is a normalizing factor used to make scores between queries comparable • boost(t) term boost • norm(t,d) encapsulates a few (indexing time) boost and length factors
  • 43. Lucene TopDocs
  • 44. Document Indexing FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, config); Document doc = new Document(); doc.add(new Field("super_name", "Sandman", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); writer.close();
  • 45. Field Options • Field.Store – NO: field value is not stored – YES: field value is stored • Field.Index – ANALYZED: field is analyzed by the analyzer – NO: not indexed – NOT_ANALYZED: indexed but not analyzed
  • 46. Luke • Lucene diagnostic tool: http://code.google.com/p/luke/
  • 48. Document: Fields Representation http://www.bbc.co.uk/news/technology-15365207 URL: Index.Store.YES, Data: Index.Store.YES, Field.Index.NO Field.Index.NOT_ANALYZED Authors: Index.Store.YES, Field.Index.ANALYZED Title: Index.Store.NO, Field.Index.ANALYZED Abstract: Index.Store.NO, Field.Index.ANALYZED Content: Index.Store.NO, Field.Index.ANALYZED Comments: Index.Store.NO, Field.Index.ANALYZED
  • 49. Text Operations • Tokenization – split text in token • Stop word elimination – remove common words and closed word class (e.g. function words) • Stemming – reducing inflected (or sometimes derived) words to their stem
  • 50. Analyzers • KeywordAnalyzer – Returns a single token • WhitespaceAnalyzer – Splits tokens at whitespace • Simple Analyzer – Divides text at nonletter characters and lowercases • StopAnalyzer – Divides text at non letter characters, lowercases, and removes stop words • StandardAnalyzer – Tokenizes based on sophisticated grammar that recognizes e- mail addresses, acronyms, etc.; lowercases and removes stop words (optional)
  • 51. Analyzers The quick brown fox jumped over the lazy dogs • WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] • SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] • StopAnalyzer: [quick] [brown] [fox] [jumped] [lazy] [dogs] • StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
  • 52. Analyzers "XY&Z Corporation - xyz@example.com" • WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] • SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] • StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 53. Tokenizers • A Tokenizer is a TokenStream whose input is a Reader • Tokenizers break field text into tokens • StandardTokenizer – source string: “full-text lucene.apache.org” – “full” “text” “lucene.apache.org” • WhitespaceTokenizer – “full-text” “lucene.apache.org” • LetterTokenizer – “full” “text” “lucene” “apache” “org”
  • 54. TokenFilters • A TokenFilter is a TokenStream whose input is another token stream • LowerCaseFilter • StopFilter • LengthFilter • PorterStemFilter – stemming: reducing words to root form – rides, ride, riding => ride – country, countries => countri
  • 55. Analyzer public class MyAnalyzer2 extends Analyzer { private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET)); @Override public TokenStream tokenStream(String string, Reader reader) { TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader); ts = new LowerCaseFilter(Version.LUCENE_36, ts); ts = new StopFilter(Version.LUCENE_36, ts, sw); return ts; } }
  • 56. TermVectors • Store information about term frequency, positions and offsets • Relevance Feedback and “More Like This” • Clustering • Similarity between two documents • Highlighter – needs offsets info
  • 57. Lucene TermVector • TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance • As a tuple: – termFreq = <term, term countD> – <fieldName, <…,termFreqi, termFreqi+1,…>> • As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies()
  • 58. Lucene TermPositionVector • TermPositionVector is a representation of all of the terms, term counts, term position and term offsets in a specific Field of a Document instance • As Java: – public String getField(); – public String[] getTerms(); – public int[] getTermFrequencies() – public int[] getTermPositions(int index) – public TermVectorOffsetInfo[] getOffsets(int index)
  • 59. Store TermVector • During indexing, create a Field that stores Term • Vectors: – new Field(“text",text,Field.Store.YES,Field.Index.A NALYZED,Field.TermVector.YES); • Options are: – Field.TermVector.YES – Field.TermVector.NO – Field.TermVector.WITH_POSITIONS – Field.TermVector.WITH_OFFSETS – Field.TermVector.WITH_POSITIONS_OFFSETS
  • 60. Term Vector: Positions and Offsets Positions 0 1 2 3 4 5 6 7 8 The quick brown fox jumped over the lazy dogs 0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45 Offsets
  • 61. Example • Main method with three parameters – Input_dir: directory to index – Index_dir: directory containing the index files – Flag (string) corresponding to different Field.TermVector... options • t: tokenize • tv: term vectors • tvp: term vectors with positions • tvo: term vectors with positions and offsets • The constructor of the Index class takes in input the three parameters and indexes the directory – Filters only “.txt” files
  • 62. Lucene Search • IndexReader: interface for accessing an index • QueryParser: parses the user query • IndexSearcher: implements search over a single IndexReader – search(Query query, int numDoc) – TopDocs -> result of search • TopDocs.scoreDocs returns an array of retrieved documents
  • 63. Lucene Search Directory dir = FSDirectory.open(new File(args[0])); IndexReader reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer); Query query = parser.parse(args[1]); TopDocs topDocs = searcher.search(query, 100); System.out.println("matches: " + topDocs.totalHits); for (int i = 0; i < topDocs.scoreDocs.length; i++) { Document doc = reader.document(topDocs.scoreDocs[i].doc); System.out.println("name: " + doc.get("name") + ", score: " + topDocs.scoreDocs[i].score); System.out.println(); } searcher.close();
  • 64. Lucene QueryParser • Example: – queryParser.parse(“name:SpiderMan”); • good human entered queries, debugging • does text analysis and constructs appropriate queries • not all query types supported
  • 65. QUERY SYNTAX 1/2 • title:"The Right Way" AND text:go – Phrase query + term query • Wildcard Searches – tes* (test - tests - tester) – te?t (test – text) – te*t (tempt) – tes? • Fuzzy Searches (Levenshtein Distance) (TERM) – roam~ (foam – roams) – roam~0.8 • Range Searches – mod_date:[20020101 TO 20030101] – title:{Aida TO Carmen} • Proximity Searches (PHRASE) – "jakarta apache"~10
  • 66. QUERY SYNTAX 2/2 • Boosting a Term – jakarta^4 apache – "jakarta apache"^4 "Apache Lucene” • Boolean Operator – NOT, OR, AND – + required operator – - prohibit operator • Grouping by ( ) • Field Grouping • title:(+return +"pink panther") • Escaping Special Characters by
  • 67. QUERY (BY CODE) 1/2 TermQuery query = new TermQuery(new Term(“name”,”Spider-Man”)) • explicit, no escaping necessary • does not do text analysis for you • Query – TermQuery – BooleanQuery – PhraseQuery – FuzzyQuery / WildcardQuery /
  • 68. QUERY (BY CODE) 2/2 TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”)) TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”)) BooleanQuery bq = new BooleanQuery(); bq.add(tq1, BooleanClause.Occur.MUST); bq.add(tq2, BooleanClause.Occur.SHOULD);
  • 69. DELETING DOCUMENTS IndexWriter – deleteDocuments(Term t) – deleteDocuments(Term… t) – deleteDocuments(Query t) – deleteDocuments(Query… t) – updateDocument(Term t, Document d) – updateDocument(Term t, Document d, Analyzer a) – deleting does not immediately reclaim space
  • 71. Relevance feedback • Improve retrieval performance using information about document relevance – Explicit: relevance indicated by the user (assessor) – Implicit: from user behavior – Blind (or pseudo): using information about the top k retrieved documents
  • 72. Relevance feedback User query New retrieved Searcher documents Retrieved Relevance feedback documents New query
  • 73. Rocchio Algorithm   1  1  Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel Original query Relevant document Non-relevant document
  • 74. Rocchio Algorithm (blind)   1  1  Qnew    Q      Dj      Dk Drel D j Drel Dnorel Dk Dnorel In blind (pseudo) relevance feedback we know only relevant documents (supposed to be the top K documents) (generally adopted by search engine)
  • 75. Relevance feedback Q = Alice Qnew = Alice Cheshire Cat King book It’s a friend of mine — a Cheshire It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to Cat,’ said Alice: ‘allow me to introduce it. introduce it. It’s the oldest rule in the book,’ It’s the oldest rule in the book,’ said the King. ‘Then it ought to said the King. ‘Then it ought to be Number One,’ said Alice be Number One,’ said Alice Alice looked round eagerly, and found Alice book was reprinted and that it was the Red Queen. ‘She’s grown a published in 1866. good deal!’ was her first remark. … Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
  • 76. Relevance feedback in Lucene • Possible using TermVector – Retrieve the top K documents – Build the Qnew using TermVector – Re-query using Qnew
  • 77. Document similarity • Documents are represented by vectors – Build the document vector using TermVector – Compute the similarity using cosine similarity • Implement a functionality like “similar to this document”
  • 78. Apache Tika • Extract metadata and text from several file formats – XML, HTML, OpenOffice, Microsoft Office, PDF – MBOX format (email) – Compressed files – EXIF metadata from image – Metadata from audio file (MP3) • Apache project as Lucene
  • 79. Extract metadata and text Tika tika = new Tika(); String type = tika.detect(new File(args[0])); System.out.println("File type: " + type); Metadata metadata = new Metadata(); String text = tika.parseToString(new FileInputStream(new File(args[0])), metadata); System.out.println("Metadata"); String[] names = metadata.names(); for (int i = 0; i < names.length; i++) { String value = metadata.get(names[i]); if (value != null) { System.out.println(names[i] + "=" + value); } } System.out.println("Text: " + text);
  • 80. Apache Tika + Lucene Files/URLs metadata Tika Lucene text
  • 82. Conclusions • A popular IR model – Vector Space Model • Lucene – provides API to build search engine • Apace Tika – extracts metadata and text from files and URLs • LET’S GO TO BUILD YOUR SEARCH ENGINE!
  • 83. Lucene Contrib • analyzers: new analyzer – Arabic, Chinese, … • highlighter: a set of classes for highlighting matching terms in search results • Snowball: stemmer analyzer based on Snowball library http://snowball.tartarus.org • spellchecker: tools for spellchecking and suggestions with Lucene
  • 84. Lucene related project • Apache Solr: open source enterprise search platform from the Apache Lucene – Full-Text Search Capabilities, High Volume Web Traffic, HTML Administration Interfaces • Apache Nutch: web-search engine based on Solr and Lucene – crawler, link-graph database, HTML
  • 85. Thank you for your attention! Questions?