SlideShare a Scribd company logo
1 of 60
.




  Seminar on

                    Information Retrieval (IR)



By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 3 Nov. 2009

           Hadi Mohammadzadeh        Information Retrieval )IR(   50 Pages       1
.




       Information Retrieval Definition

• Information Retrieval (IR) is :
     finding material (usually documents)
     of an unstructured nature (usually text)
     that satisfies an information
    need(query)
      from within large collections (usually stored
    on computers).


          Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       2
.




Basic assumptions of Information Retrieval

• Collection: Fixed set of documents
• Goal: Retrieve documents with information
  that is relevant to user’s information need
  and helps him complete a task




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       3
.




          Search Methods
                                for

      Finding Documents




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       4
.




             Searching Methods

 Grep method
 Term-document incidence matrix (Binary Ret.)
 Inverted index
 Inverted index mit Skip pointers/Skip lists
 Positional Postings (for Phrase queries)




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       5
.




     Term-document incidence

            Antony and Cleopatra    Julius Caesar     The Tempest     Hamlet       Othello   Macbeth

 Antony              1                     1                0              0         0         1
 Brutus              1                     1                0              1         0         0
 Caesar              1                     1                0              1         1         1
Calpurnia            0                     1                0              0         0         0
Cleopatra            1                     0                0              0         0         0
 mercy               1                     0                1              1         1         1
 worser              1                     0                1              1         1         0




                                                                  1 if play contains
                                                                  word, 0 otherwise
            Hadi Mohammadzadeh     Information Retrieval )IR(   50 Pages                               6
.    Sec. 1.2




                    Inverted index
• For each term T, we must store a list of all
  documents that contain T.
• Do we use an array or a list for this?

Brutus                        2       4         8          16 32 64 128
Calpurnia                     1        2        3          5       8      13 21 34
Caesar                    13 16

            What happens if the word Caesar
            is added to document 14?
                                                                                          7
         Hadi Mohammadzadeh   Information Retrieval )IR(       50 Pages                       7
.         Sec. 1.2




                    Inverted index
• Linked lists generally preferred to arrays
   – Dynamic space allocation
   – Insertion of terms into documents easy
                                                                                   Posting
   – Space overhead of pointers

  Brutus                         2         4         8       16       32       64       128
  Calpurnia                       1         2        3       5        8       13   21        34

  Caesar                          13            16


Dictionary                                                   Postings lists
                                                                                               8
         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                                8
.



  Augment postings with skip pointers
          (at indexing time)
       41                          128
      2       4       8        41        48           64         128


       11                                  31
      1       2       3        8      11            17       21        31

• Why?
• To skip postings that will not figure in the search
  results.
• Where do we place skip pointers?

          Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages        9
.




     Where do we place skips?
• Tradeoff:
  – More skips → shorter skip spans ⇒ more
    likely to skip. But lots of comparisons to skip
    pointers.
  – Fewer skips → few pointer comparison, but
    then long skip spans ⇒ few successful skips.




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       10
.




      Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231;                                         Which of docs 1,2,4,5
2: 3, 149;                                                         could contain “to be
4: 17, 191, 291, 430, 434;                                            or not to be”?
5: 363, 367, …>




      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                      11
.     Sec. 1.2



 Steps of Inverted index construction
Documents to                                Friends, Romans, countrymen.
be indexed.


                            Tokenizer
Token stream.                             Friends Romans          Countrymen
                            Linguistic
                            modules
Modified tokens.                            friend       roman    countryman

                             Indexer friend                          2         4

                                          roman                      1         2
Inverted index.
                                               countryman           13         16
       Hadi Mohammadzadeh   Information Retrieval )IR( 50 Pages                12
.




    Parts of an Inverted Index
• Dictionary
  – Commonly keep in memory
• Posting lists
  – Commonly keep in disk




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       13
.



                  Inverted index construction
Preprocessing to form the term vocabulary
 Tokenization (problems)
      Hyphens
      apostrophes
      Compounds
      Chinese
      numbers
 Dropping Stop Words
    But you need them: Phrase queries, various song titles,
     Relational queries
 Normalization (Term equivalence classing)
      Numbers
      case folding (Reduce all letters to lower case)
      Stemming ( Porter’s algorithm) Reduce terms to their “roots”
      lemmatization (Reduce variant forms to base form)
        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       14
.


                          Inverted index construction

                  Index Construction
Blocked Sort-based indexing (BSBI)
     Algorithm
          Accumulate posting for each block, sort, write to disk
          Then merge (External sorting) the blocks into one long sorted order
Distributed indexing using MapReduce
     Break up indexing into sets of 2 parallel tasks
          Parsers
          Invertors
     Break the input document corpus into splits
     Parsers
          Master assign a split to an idle parser machine
          Parser reads a document at a time and emit (term,doc) pairs
          Parser writes pairs into j partitions
          Each partition is for a range of term's first letters
     Inverters
          An inverter collects all (term,doc) pairs for one term-partition
          Sorts and writes to postings list

Dynamic Indexing
          Hadi Mohammadzadeh    Information Retrieval )IR(   50 Pages            15
.



                         Inverted index construction
                     Data flow
                  Index Construction
               assign             Master                     assign
                                                                                 Postings

               Parser                  a-f g-p q-z                        Inverter   a-f

               Parser                  a-f g-p q-z
                                                                          Inverter   g-p

splits                                                                    Inverter   q-z
               Parser                  a-f g-p q-z


             Map                                                      Reduce
                                   Segment files
             phase                                                    phase
         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                    16
.




Search structures for Dictionary
 A naïve dictionary
 Hash tables
 Trees
      Binary tree
      B-Tree




      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       17
.




          Index compression
Dictionary compression for Boolean indexes
     Array of fixed/width entries (it is wasteful)
     Dictionary as a string
     Blocking
     Front coding

Postings compression
     Gap encoding using prefix-unique codes
     Variable-Byte
     Gamma codes ( seldom used in practice)




        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       18
.


             Dictionary compression for Boolean indexes
                   Dictionary-as-a-String

                     ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq.   Postings ptr. Term ptr.
                                                                            Total string length =
33
                                                                            400K x 8B = 3.2MB
29
44
                                                                             Pointers resolve 3.2M
126                                                                          positions: log23.2M =
                                                                                22bits = 3bytes




               Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                        19
.


              Dictionary compression for Boolean indexes
                                       Blocking

            ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….



Freq.   Postings ptr. Term ptr.
33
29                                 Save 9 bytes                                  Lose 4 bytes on
44                                 on 3                                           term lengths.
126                                pointers.
7




                 Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                     20
.


    Dictionary compression for Boolean indexes
                     Front coding
 – Sorted words commonly have
    • long common prefix – store differences only
 – (for last k-1 in a block of k)
 8automata8automate9automatic10automatio
  n
    →8automat*a1◊e2◊ic3◊ion



Encodes automat                     Extra length
                                    beyond automat.

      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       21
.




     Information Retrieval


Ranked Retrieval


 Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       22
.


                       Information Retrieval
                    Ranked retrieval
• Thus far, our queries have all been Boolean.
• Good for expert users
• Also good for applications: Applications can
  easily consume 1000s of results.
  – Not good for the majority of users.
  – Most users incapable of writing Boolean queries (or
    they are, but they think it’s too much work).
• Most users don’t want to wade through 1000s of
  results.
  – This is particularly true of web search


        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       23
.




                    Term Weighting
• Term frequency and Inverse document frequency
   – TF                  +log10 tf t,d ,
                         1                                          if tf t,d >0
                wt,d   =
                            0,                                     otherwise


   – IDF: the number of docs in the collection that contain a term t
                             idf t = log10 N/df t
• td-idf weighting
   – The tf-idf weight of a term is the product of its tf weight and its idf weight

                  w t ,d = (1 + log tf t ,d ) × log10 N / df t
• td-idf is the best known weighting scheme in
  information retrieval
             Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages               24
.




Vector space model for scoring
– Represent the query as a weighted tf-idf vector
– Represent each document as a weighted tf-idf vector
– Compute the cosine similarity score for the query vector
  and each document vector

                             
                           
                                                            ∑
                                                                V
                  q •d   q d                                         qi d i
      cos( q , d ) =   =  •  =                              i =1
                           q d
                     qd
                                                        ∑i =1 qi2       ∑
                                                           V                    V
                                                                                i =1
                                                                                     d i2



– Rank documents with respect to the query by score
– Increases with the number of occurrences within a
  document
– Increases with the rarity of the term in the collection
      Hadi Mohammadzadeh   Information Retrieval )IR(     50 Pages                          25
.


                       Providing heuristics methods
                                                 for

    Speeding up Vector Space Scoring & Ranking
    –    Many of these heuristics achieve their speed at risk
         of not finding quite top K documents matching query
•    Efficient Scoring & ranking
    1.   Inexact top K document retrieval
    2.   Index Elimination
    3.   Champion lists
    4.   Static quality scores
         •   We want top-ranking documents to be both relevant and
             authoritative
         •   Relevance is being modeled by cosine scores
         •   Authority is typically a query-independent property of a
             document
         •   Assign a query-independent quality score in [0,1] to each
             document d, Denote this by g(d)

              Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       26
.


                    Providing heuristics methods
                                               for

 Speeding up Vector Space Scoring & Ranking(Cont.)

5 - Cluster pruning: preprocessing
    •   Pick √N docs at random: call these leaders
    •   For every other doc, pre-compute nearest leader
        –     Docs attached to a leader: its followers;
        –     Likely: each leader has ~ √N followers.
    •   Process a query as follows:
        –     Given query Q, find its nearest leader L.
        –     Seek K nearest docs from among L’s followers
–   Net score for a document d
    •   net-score can be computed as combination of cosine
        relevance and authority e.g. net-score(q,d) = g(d) +
        cosine(q,d)
    •   Top K by net score – fast methods
            Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       27
.




         Cluster Pruning



                                                       Query




Leader                                      Follower
Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       28
.




      Parametric and zone indexes
• In fact documents have multiple parts, some with
  special semantics:
        – Author, Title, Date of publication, Language, Format, etc.
     • These constitute the metadata about a document
     • We sometimes wish to search by these metadata
     • Field or parametric index: postings for each field value
        – Field query typically treated as conjunction
     • A zone is a region of the doc that can contain an
       arbitrary amount of text e.g., Title, Abstract,
       References …
         – Build inverted indexes on zones as well to permit
           querying


         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       29
.




     Example zone indexes




Encode zones in dictionary vs. postings.


     Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       30
.




                  Tiered indexes
– Tiered indexes
  • Break postings up into a hierarchy of lists
     – Most important
     – …
     – Least important
  • Can be done by g(d) or another measure
  • Inverted index thus broken up into tiers of decreasing
    importance
  • At query time use top tier unless it fails to yield K docs
      – If so drop to lower tiers




      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       31
.




 Example tiered index




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       32
.




A Complete Search System




  Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       33
.




          Evaluating
         Search Engine
               (Ranked Retrieval Method)




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       34
.




 Measures for a search engine
Which parameters are very important in SE

  – How fast does a search engine index
  – How fast does a search engine search
  – Expressiveness of query language
  – Uncluttered User Interface(UI)
  – Is it free?



       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       35
.


                         The key measure
                    User happiness
• Useless answers won’t make a user happy
• Need a way of quantifying user happiness
• Issue: who is the user we are trying to make happy?
  – Web engine
  – eCommerce site
  – Enterprise (company/govt/academic)
• Happiness: elusive to measure




         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       36
.




Evaluation of unranked retrieval
  – Precision: fraction of retrieved docs that are relevant =
     P( relevant | retrieved )
  – Recall: fraction of relevant docs that are retrieved =
              P( retrieved | relevant )

                             Relevant                       Nonrelevant
     Retrieved               tp                             fp
     Not Retrieved           fn                             tn

                • Precision P = tp/(tp + fp)
                • Recall    R = tp/(tp + fn)




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages         37
.




Evaluation of unranked retrieval                                       (Cont.)


• What about Accuracy
  – The accuracy of an engine: the fraction of
    classifications that are correct
  – Accuracy is a used in machine learning
    classification work
  – Why is this not a very useful evaluation measure
    in IR?
  – How to build a 99.9999% accurate search engine
    on a low budget….




                                                                          38
      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                 38
.




 Evaluation of unranked retrieval                                         (Cont.)



• F measure
  – Combined measure that assesses precision/recall
    tradeoff is F measure (weighted harmonic mean):

                                      1  ( β 2 + 1) PR
                     F=                =
                         1
                        α + (1 − α )
                                     1     β 2P + R
                         P           R

  – People usually use balanced F1 measure i.e., with β = 1
    or α = ½
  – For F1 the best value is 1 and the worst value is 0

                                                                             39
         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                 39
.




Evaluation of Ranked Retrieval
• By taking various numbers of the top returned
  documents (levels of recall), the evaluator can produce
  a precision-recall curve
• We can determine a value between the points using
  Interpolation
• 11-point interpolated average precision
• Other methods: Mean average precision (MAP) and R-
  precision




                                                                        40
       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages            40
.




A precision-recall curve
            1.0

            0.8
Precision




            0.6

            0.4

            0.2

            0.0
                  0.0    0.2           0.4            0.6        0.8   1.0
                                            Recall
                                                                             41
    Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                    41
.




    Typical (good) 11 point precisions
• SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
                          1



                         0.8



                         0.6
             Precision




                         0.4



                         0.2



                          0
                               0    0.2         0.4             0.6          0.8       1
                                                      Recall
                                                                                           42
        Hadi Mohammadzadeh         Information Retrieval )IR(         50 Pages                  42
.




Relevance Feedback (RF)
                                  for

        Query Refinement
                                 In

             Search Engine



Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       43
.




               Relevance Feedback
• user feedback on relevance of docs in initial set of
  results
   – User issues a (short, simple) query
   – The user marks some results as relevant or non-relevant.
   – The system computes a better representation of the
     information need based on feedback.
   – Relevance feedback can go through one or more
     iterations.

• Idea: it may be difficult to formulate a good query when you
  don’t know the collection well, so iterate


           Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       44
.




    Relevance Feedback: Example
• Image search engine
 http://nayana.ece.ucsb.edu/imsearch/imsearch.html




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       45
.




 Results for Initial Query




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       46
.




    Relevance Feedback




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       47
.




Results after Relevance Feedback




   Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       48
.




          Key concept: Centroid
• The centroid is the center of mass of a set of
  points
• Recall that we represent documents as points in
  a high-dimensional space
• Definition: Centroid
                            1       
                  µ (C ) =       ∑d
                           | C | d∈C
where C is a set of documents.


       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       49
.




                Rocchio Algorithm
• The Rocchio algorithm uses the vector space model to
  pick a relevance fed-back query
• Rocchio seeks the query q opt that maximizes
                                            
     qopt = arg max [cos( q , µ (Cr )) − cos( q , µ (Cnr ))]
                    
                    q

• Tries to separate docs marked relevant and non-
  relevant           1           1      
                 qopt =
                             C        ∑d         j   −
                                                          C         ∑d        j
                                 r d j ∈Cr                    nr d j ∉Cr

• Problem: we don’t know the truly relevant docs

        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages                 50
.




  Rocchio 1971 Algorithm (SMART)
• Used in practice:
               1                   1                
  q m = α q0 + β
                 Dr         ∑d j −γ D               ∑dj
                            d j ∈Dr    nr           d j ∈Dnr

• Dr = set of known relevant doc vectors
• Dnr = set of known irrelevant doc vectors
   – Different from Cr and Cnr   !
• qm = modified query vector; q0 = original query
  vector; α,β,γ: weights (hand-chosen or set
  empirically)
• New query moves toward relevant documents and
       Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages   51
.




     The Theoretically Best Query

                                            x                 x
                            x                            x
                            o                    x                           x
                                           x         x x x
                                    x                                    x
                        o                       x
                    o                                        x
                                o                     x                 x
      ∆          o o                                   x
                                    x


                                                        x non-relevant documents
Optimal
query                                                   o relevant documents
          Hadi Mohammadzadeh    Information Retrieval )IR(   50 Pages              52
.




      Relevance feedback on initial query
Initial
                                                   x                     x
query         x
                                   o                                                    x
                               ∆                  x                 x
                    x                  x
                                           x                                        x
                              o                        x
                   x      o∆
          x                            o                                           x
                         o o                                   x
                             x             x
              x
                                                 x known non-relevant documents
  Revised
  query                                          o known relevant documents
                  Hadi Mohammadzadeh   Information Retrieval )IR(       50 Pages            53
.




  Relevance Feedback in vector spaces

• We can modify the query based on relevance
  feedback and apply standard vector space model.
• Use only the docs that were marked.
• Relevance feedback can improve recall and
  precision
• Relevance feedback is most useful for increasing
  recall in situations where recall is important
   – Users can be expected to review results and to
     take time to iterate


      Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       54
.




     Relevance feedback revisited
• In relevance feedback, the user marks a number of
  documents as relevant/nonrelevant.
• We then try to use this information to return better
  search results.
• Suppose we just tried to learn a filter for nonrelevant
  documents
• This is an instance of a text classification problem:
   – Two “classes”: relevant, nonrelevant
   – For each document, decide whether it is relevant or
     nonrelevant



         Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       55
.




        Text Classification




Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       56
.


                  Classification Methods #1

          Manual classification
• Used by Yahoo! (originally; now present but
  downplayed), Looksmart, about.com, ODP,
  PubMed
• Very accurate when job is done by experts
• Consistent when the problem size and team is
  small
• Difficult and expensive to scale




       Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       57
.


                   Classification Methods #2
   Automatic document classification
• Hand-coded rule-based systems
  – One technique used by CS dept’s spam filter,
    Reuters, CIA, etc.
  – Companies (Verity) provide “IDE” for writing such
    rules
  – Accuracy is often very high if a rule has been carefully
    refined over time by a subject expert
  – Building and maintaining these rules is expensive




        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       58
.


                     Classification Methods #3

           Supervised learning
• Supervised learning of a document-label
  assignment function
  – Many systems partly rely on machine learning
     • k-Nearest Neighbors (simple, powerful)
     • Naive Bayes (simple, common method)
     • Support-vector machines (new, more powerful)
     • No free lunch: requires hand-classified training data
     • But data can be built up (and refined) by amateurs




          Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       59
.




                         References
• Introduction to Information Retrieval-2008
• Managing Gigabytes-1999




        Hadi Mohammadzadeh   Information Retrieval )IR(   50 Pages       60

More Related Content

What's hot

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Memory management
Memory managementMemory management
Memory managementcpjcollege
 
Topic detection & tracking
Topic detection & trackingTopic detection & tracking
Topic detection & trackingGeorge Ang
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Disk scheduling algorithms
Disk scheduling algorithms Disk scheduling algorithms
Disk scheduling algorithms Paresh Parmar
 
Common communication format
Common communication formatCommon communication format
Common communication formatavid
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
Phase relation by aman kr kushwaha
Phase relation by aman kr kushwahaPhase relation by aman kr kushwaha
Phase relation by aman kr kushwahaAMAN KUMAR KUSHWAHA
 
File Management in Operating System
File Management in Operating SystemFile Management in Operating System
File Management in Operating SystemJanki Shah
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Jeet Das
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
A critique on traditional file system vs databases
A critique on traditional file system vs databasesA critique on traditional file system vs databases
A critique on traditional file system vs databasesShallote Dsouza
 

What's hot (20)

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Memory management
Memory managementMemory management
Memory management
 
Topic detection & tracking
Topic detection & trackingTopic detection & tracking
Topic detection & tracking
 
Signature files
Signature filesSignature files
Signature files
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Disk scheduling algorithms
Disk scheduling algorithms Disk scheduling algorithms
Disk scheduling algorithms
 
Common communication format
Common communication formatCommon communication format
Common communication format
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Phase relation by aman kr kushwaha
Phase relation by aman kr kushwahaPhase relation by aman kr kushwaha
Phase relation by aman kr kushwaha
 
File Management in Operating System
File Management in Operating SystemFile Management in Operating System
File Management in Operating System
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
A critique on traditional file system vs databases
A critique on traditional file system vs databasesA critique on traditional file system vs databases
A critique on traditional file system vs databases
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 

Viewers also liked

Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Mustafa Jarrar
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalMustafa Jarrar
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Indexing, vector spaces, search engines
Indexing, vector spaces, search enginesIndexing, vector spaces, search engines
Indexing, vector spaces, search enginesXYLAB
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 

Viewers also liked (10)

Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Jarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information RetrievalJarrar: Introduction to Information Retrieval
Jarrar: Introduction to Information Retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Indexing, vector spaces, search engines
Indexing, vector spaces, search enginesIndexing, vector spaces, search engines
Indexing, vector spaces, search engines
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
IR
IRIR
IR
 

More from Hadi Mohammadzadeh

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesHadi Mohammadzadeh
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Hadi Mohammadzadeh
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...Hadi Mohammadzadeh
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesHadi Mohammadzadeh
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesHadi Mohammadzadeh
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehHadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehHadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehHadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehHadi Mohammadzadeh
 

More from Hadi Mohammadzadeh (10)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
 

Recently uploaded

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwaitdaisycvs
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...allensay1
 
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceEluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceDamini Dixit
 
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceMalegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceDamini Dixit
 
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876dlhescort
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 MonthsIndeedSEO
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1kcpayne
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...Aggregage
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...amitlee9823
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperityhemanthkumar470700
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPanhandleOilandGas
 

Recently uploaded (20)

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceEluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
 
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceMalegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 

Information retreival, By Hadi Mohammadzadeh

  • 1. . Seminar on Information Retrieval (IR) By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 3 Nov. 2009 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 1
  • 2. . Information Retrieval Definition • Information Retrieval (IR) is :  finding material (usually documents)  of an unstructured nature (usually text)  that satisfies an information need(query)  from within large collections (usually stored on computers). Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 2
  • 3. . Basic assumptions of Information Retrieval • Collection: Fixed set of documents • Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 3
  • 4. . Search Methods for Finding Documents Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 4
  • 5. . Searching Methods  Grep method  Term-document incidence matrix (Binary Ret.)  Inverted index  Inverted index mit Skip pointers/Skip lists  Positional Postings (for Phrase queries) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 5
  • 6. . Term-document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 6
  • 7. . Sec. 1.2 Inverted index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 What happens if the word Caesar is added to document 14? 7 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 7
  • 8. . Sec. 1.2 Inverted index • Linked lists generally preferred to arrays – Dynamic space allocation – Insertion of terms into documents easy Posting – Space overhead of pointers Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Dictionary Postings lists 8 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 8
  • 9. . Augment postings with skip pointers (at indexing time) 41 128 2 4 8 41 48 64 128 11 31 1 2 3 8 11 17 21 31 • Why? • To skip postings that will not figure in the search results. • Where do we place skip pointers? Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 9
  • 10. . Where do we place skips? • Tradeoff: – More skips → shorter skip spans ⇒ more likely to skip. But lots of comparisons to skip pointers. – Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 10
  • 11. . Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5 2: 3, 149; could contain “to be 4: 17, 191, 291, 430, 434; or not to be”? 5: 363, 367, …> Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 11
  • 12. . Sec. 1.2 Steps of Inverted index construction Documents to Friends, Romans, countrymen. be indexed. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer friend 2 4 roman 1 2 Inverted index. countryman 13 16 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 12
  • 13. . Parts of an Inverted Index • Dictionary – Commonly keep in memory • Posting lists – Commonly keep in disk Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 13
  • 14. . Inverted index construction Preprocessing to form the term vocabulary Tokenization (problems)  Hyphens  apostrophes  Compounds  Chinese  numbers Dropping Stop Words  But you need them: Phrase queries, various song titles, Relational queries Normalization (Term equivalence classing)  Numbers  case folding (Reduce all letters to lower case)  Stemming ( Porter’s algorithm) Reduce terms to their “roots”  lemmatization (Reduce variant forms to base form) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 14
  • 15. . Inverted index construction Index Construction Blocked Sort-based indexing (BSBI)  Algorithm  Accumulate posting for each block, sort, write to disk  Then merge (External sorting) the blocks into one long sorted order Distributed indexing using MapReduce  Break up indexing into sets of 2 parallel tasks  Parsers  Invertors  Break the input document corpus into splits  Parsers  Master assign a split to an idle parser machine  Parser reads a document at a time and emit (term,doc) pairs  Parser writes pairs into j partitions  Each partition is for a range of term's first letters  Inverters  An inverter collects all (term,doc) pairs for one term-partition  Sorts and writes to postings list Dynamic Indexing Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 15
  • 16. . Inverted index construction Data flow Index Construction assign Master assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Inverter q-z Parser a-f g-p q-z Map Reduce Segment files phase phase Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 16
  • 17. . Search structures for Dictionary  A naïve dictionary  Hash tables  Trees  Binary tree  B-Tree Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 17
  • 18. . Index compression Dictionary compression for Boolean indexes  Array of fixed/width entries (it is wasteful)  Dictionary as a string  Blocking  Front coding Postings compression  Gap encoding using prefix-unique codes  Variable-Byte  Gamma codes ( seldom used in practice) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 18
  • 19. . Dictionary compression for Boolean indexes Dictionary-as-a-String ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Freq. Postings ptr. Term ptr. Total string length = 33 400K x 8B = 3.2MB 29 44 Pointers resolve 3.2M 126 positions: log23.2M = 22bits = 3bytes Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 19
  • 20. . Dictionary compression for Boolean indexes Blocking ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Freq. Postings ptr. Term ptr. 33 29  Save 9 bytes Lose 4 bytes on 44  on 3 term lengths. 126  pointers. 7 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 20
  • 21. . Dictionary compression for Boolean indexes Front coding – Sorted words commonly have • long common prefix – store differences only – (for last k-1 in a block of k) 8automata8automate9automatic10automatio n →8automat*a1◊e2◊ic3◊ion Encodes automat Extra length beyond automat. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 21
  • 22. . Information Retrieval Ranked Retrieval Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 22
  • 23. . Information Retrieval Ranked retrieval • Thus far, our queries have all been Boolean. • Good for expert users • Also good for applications: Applications can easily consume 1000s of results. – Not good for the majority of users. – Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). • Most users don’t want to wade through 1000s of results. – This is particularly true of web search Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 23
  • 24. . Term Weighting • Term frequency and Inverse document frequency – TF  +log10 tf t,d , 1 if tf t,d >0 wt,d =  0, otherwise – IDF: the number of docs in the collection that contain a term t idf t = log10 N/df t • td-idf weighting – The tf-idf weight of a term is the product of its tf weight and its idf weight w t ,d = (1 + log tf t ,d ) × log10 N / df t • td-idf is the best known weighting scheme in information retrieval Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 24
  • 25. . Vector space model for scoring – Represent the query as a weighted tf-idf vector – Represent each document as a weighted tf-idf vector – Compute the cosine similarity score for the query vector and each document vector     ∑ V   q •d q d qi d i cos( q , d ) =   =  •  = i =1 q d qd ∑i =1 qi2 ∑ V V i =1 d i2 – Rank documents with respect to the query by score – Increases with the number of occurrences within a document – Increases with the rarity of the term in the collection Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 25
  • 26. . Providing heuristics methods for Speeding up Vector Space Scoring & Ranking – Many of these heuristics achieve their speed at risk of not finding quite top K documents matching query • Efficient Scoring & ranking 1. Inexact top K document retrieval 2. Index Elimination 3. Champion lists 4. Static quality scores • We want top-ranking documents to be both relevant and authoritative • Relevance is being modeled by cosine scores • Authority is typically a query-independent property of a document • Assign a query-independent quality score in [0,1] to each document d, Denote this by g(d) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 26
  • 27. . Providing heuristics methods for Speeding up Vector Space Scoring & Ranking(Cont.) 5 - Cluster pruning: preprocessing • Pick √N docs at random: call these leaders • For every other doc, pre-compute nearest leader – Docs attached to a leader: its followers; – Likely: each leader has ~ √N followers. • Process a query as follows: – Given query Q, find its nearest leader L. – Seek K nearest docs from among L’s followers – Net score for a document d • net-score can be computed as combination of cosine relevance and authority e.g. net-score(q,d) = g(d) + cosine(q,d) • Top K by net score – fast methods Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 27
  • 28. . Cluster Pruning Query Leader Follower Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 28
  • 29. . Parametric and zone indexes • In fact documents have multiple parts, some with special semantics: – Author, Title, Date of publication, Language, Format, etc. • These constitute the metadata about a document • We sometimes wish to search by these metadata • Field or parametric index: postings for each field value – Field query typically treated as conjunction • A zone is a region of the doc that can contain an arbitrary amount of text e.g., Title, Abstract, References … – Build inverted indexes on zones as well to permit querying Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 29
  • 30. . Example zone indexes Encode zones in dictionary vs. postings. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 30
  • 31. . Tiered indexes – Tiered indexes • Break postings up into a hierarchy of lists – Most important – … – Least important • Can be done by g(d) or another measure • Inverted index thus broken up into tiers of decreasing importance • At query time use top tier unless it fails to yield K docs – If so drop to lower tiers Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 31
  • 32. . Example tiered index Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 32
  • 33. . A Complete Search System Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 33
  • 34. . Evaluating Search Engine (Ranked Retrieval Method) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 34
  • 35. . Measures for a search engine Which parameters are very important in SE – How fast does a search engine index – How fast does a search engine search – Expressiveness of query language – Uncluttered User Interface(UI) – Is it free? Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 35
  • 36. . The key measure User happiness • Useless answers won’t make a user happy • Need a way of quantifying user happiness • Issue: who is the user we are trying to make happy? – Web engine – eCommerce site – Enterprise (company/govt/academic) • Happiness: elusive to measure Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 36
  • 37. . Evaluation of unranked retrieval – Precision: fraction of retrieved docs that are relevant = P( relevant | retrieved ) – Recall: fraction of relevant docs that are retrieved = P( retrieved | relevant ) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 37
  • 38. . Evaluation of unranked retrieval (Cont.) • What about Accuracy – The accuracy of an engine: the fraction of classifications that are correct – Accuracy is a used in machine learning classification work – Why is this not a very useful evaluation measure in IR? – How to build a 99.9999% accurate search engine on a low budget…. 38 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 38
  • 39. . Evaluation of unranked retrieval (Cont.) • F measure – Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): 1 ( β 2 + 1) PR F= = 1 α + (1 − α ) 1 β 2P + R P R – People usually use balanced F1 measure i.e., with β = 1 or α = ½ – For F1 the best value is 1 and the worst value is 0 39 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 39
  • 40. . Evaluation of Ranked Retrieval • By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve • We can determine a value between the points using Interpolation • 11-point interpolated average precision • Other methods: Mean average precision (MAP) and R- precision 40 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 40
  • 41. . A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall 41 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 41
  • 42. . Typical (good) 11 point precisions • SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) 1 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 42 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 42
  • 43. . Relevance Feedback (RF) for Query Refinement In Search Engine Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 43
  • 44. . Relevance Feedback • user feedback on relevance of docs in initial set of results – User issues a (short, simple) query – The user marks some results as relevant or non-relevant. – The system computes a better representation of the information need based on feedback. – Relevance feedback can go through one or more iterations. • Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 44
  • 45. . Relevance Feedback: Example • Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 45
  • 46. . Results for Initial Query Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 46
  • 47. . Relevance Feedback Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 47
  • 48. . Results after Relevance Feedback Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 48
  • 49. . Key concept: Centroid • The centroid is the center of mass of a set of points • Recall that we represent documents as points in a high-dimensional space • Definition: Centroid  1  µ (C ) = ∑d | C | d∈C where C is a set of documents. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 49
  • 50. . Rocchio Algorithm • The Rocchio algorithm uses the vector space model to pick a relevance fed-back query • Rocchio seeks the query q opt that maximizes      qopt = arg max [cos( q , µ (Cr )) − cos( q , µ (Cnr ))]  q • Tries to separate docs marked relevant and non- relevant  1  1  qopt = C  ∑d j − C  ∑d j r d j ∈Cr nr d j ∉Cr • Problem: we don’t know the truly relevant docs Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 50
  • 51. . Rocchio 1971 Algorithm (SMART) • Used in practice:   1  1  q m = α q0 + β Dr ∑d j −γ D  ∑dj d j ∈Dr nr d j ∈Dnr • Dr = set of known relevant doc vectors • Dnr = set of known irrelevant doc vectors – Different from Cr and Cnr ! • qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically) • New query moves toward relevant documents and Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 51
  • 52. . The Theoretically Best Query x x x x o x x x x x x x x o x o x o x x ∆ o o x x x non-relevant documents Optimal query o relevant documents Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 52
  • 53. . Relevance feedback on initial query Initial x x query x o x ∆ x x x x x x o x x o∆ x o x o o x x x x x known non-relevant documents Revised query o known relevant documents Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 53
  • 54. . Relevance Feedback in vector spaces • We can modify the query based on relevance feedback and apply standard vector space model. • Use only the docs that were marked. • Relevance feedback can improve recall and precision • Relevance feedback is most useful for increasing recall in situations where recall is important – Users can be expected to review results and to take time to iterate Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 54
  • 55. . Relevance feedback revisited • In relevance feedback, the user marks a number of documents as relevant/nonrelevant. • We then try to use this information to return better search results. • Suppose we just tried to learn a filter for nonrelevant documents • This is an instance of a text classification problem: – Two “classes”: relevant, nonrelevant – For each document, decide whether it is relevant or nonrelevant Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 55
  • 56. . Text Classification Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 56
  • 57. . Classification Methods #1 Manual classification • Used by Yahoo! (originally; now present but downplayed), Looksmart, about.com, ODP, PubMed • Very accurate when job is done by experts • Consistent when the problem size and team is small • Difficult and expensive to scale Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 57
  • 58. . Classification Methods #2 Automatic document classification • Hand-coded rule-based systems – One technique used by CS dept’s spam filter, Reuters, CIA, etc. – Companies (Verity) provide “IDE” for writing such rules – Accuracy is often very high if a rule has been carefully refined over time by a subject expert – Building and maintaining these rules is expensive Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 58
  • 59. . Classification Methods #3 Supervised learning • Supervised learning of a document-label assignment function – Many systems partly rely on machine learning • k-Nearest Neighbors (simple, powerful) • Naive Bayes (simple, common method) • Support-vector machines (new, more powerful) • No free lunch: requires hand-classified training data • But data can be built up (and refined) by amateurs Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 59
  • 60. . References • Introduction to Information Retrieval-2008 • Managing Gigabytes-1999 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 60

Editor's Notes

  1. SMART: Cornell (Salton) IR system of 1970s to 1990s.
  2. Just as we modified the query in the vector space model, we can also modify it here. I’m not aware of work that uses language model based Ir this way.