1. .
Seminar on
Information Retrieval (IR)
By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 3 Nov. 2009
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 1
2. .
Information Retrieval Definition
• Information Retrieval (IR) is :
finding material (usually documents)
of an unstructured nature (usually text)
that satisfies an information
need(query)
from within large collections (usually stored
on computers).
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 2
3. .
Basic assumptions of Information Retrieval
• Collection: Fixed set of documents
• Goal: Retrieve documents with information
that is relevant to user’s information need
and helps him complete a task
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 3
4. .
Search Methods
for
Finding Documents
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 4
5. .
Searching Methods
Grep method
Term-document incidence matrix (Binary Ret.)
Inverted index
Inverted index mit Skip pointers/Skip lists
Positional Postings (for Phrase queries)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 5
6. .
Term-document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
word, 0 otherwise
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 6
7. . Sec. 1.2
Inverted index
• For each term T, we must store a list of all
documents that contain T.
• Do we use an array or a list for this?
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
What happens if the word Caesar
is added to document 14?
7
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 7
8. . Sec. 1.2
Inverted index
• Linked lists generally preferred to arrays
– Dynamic space allocation
– Insertion of terms into documents easy
Posting
– Space overhead of pointers
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Dictionary Postings lists
8
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 8
9. .
Augment postings with skip pointers
(at indexing time)
41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
• Why?
• To skip postings that will not figure in the search
results.
• Where do we place skip pointers?
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 9
10. .
Where do we place skips?
• Tradeoff:
– More skips → shorter skip spans ⇒ more
likely to skip. But lots of comparisons to skip
pointers.
– Fewer skips → few pointer comparison, but
then long skip spans ⇒ few successful skips.
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 10
11. .
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 11
12. . Sec. 1.2
Steps of Inverted index construction
Documents to Friends, Romans, countrymen.
be indexed.
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic
modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 16
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 12
13. .
Parts of an Inverted Index
• Dictionary
– Commonly keep in memory
• Posting lists
– Commonly keep in disk
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 13
14. .
Inverted index construction
Preprocessing to form the term vocabulary
Tokenization (problems)
Hyphens
apostrophes
Compounds
Chinese
numbers
Dropping Stop Words
But you need them: Phrase queries, various song titles,
Relational queries
Normalization (Term equivalence classing)
Numbers
case folding (Reduce all letters to lower case)
Stemming ( Porter’s algorithm) Reduce terms to their “roots”
lemmatization (Reduce variant forms to base form)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 14
15. .
Inverted index construction
Index Construction
Blocked Sort-based indexing (BSBI)
Algorithm
Accumulate posting for each block, sort, write to disk
Then merge (External sorting) the blocks into one long sorted order
Distributed indexing using MapReduce
Break up indexing into sets of 2 parallel tasks
Parsers
Invertors
Break the input document corpus into splits
Parsers
Master assign a split to an idle parser machine
Parser reads a document at a time and emit (term,doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of term's first letters
Inverters
An inverter collects all (term,doc) pairs for one term-partition
Sorts and writes to postings list
Dynamic Indexing
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 15
16. .
Inverted index construction
Data flow
Index Construction
assign Master assign
Postings
Parser a-f g-p q-z Inverter a-f
Parser a-f g-p q-z
Inverter g-p
splits Inverter q-z
Parser a-f g-p q-z
Map Reduce
Segment files
phase phase
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 16
17. .
Search structures for Dictionary
A naïve dictionary
Hash tables
Trees
Binary tree
B-Tree
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 17
18. .
Index compression
Dictionary compression for Boolean indexes
Array of fixed/width entries (it is wasteful)
Dictionary as a string
Blocking
Front coding
Postings compression
Gap encoding using prefix-unique codes
Variable-Byte
Gamma codes ( seldom used in practice)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 18
19. .
Dictionary compression for Boolean indexes
Dictionary-as-a-String
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126 positions: log23.2M =
22bits = 3bytes
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 19
20. .
Dictionary compression for Boolean indexes
Blocking
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29 Save 9 bytes Lose 4 bytes on
44 on 3 term lengths.
126 pointers.
7
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 20
21. .
Dictionary compression for Boolean indexes
Front coding
– Sorted words commonly have
• long common prefix – store differences only
– (for last k-1 in a block of k)
8automata8automate9automatic10automatio
n
→8automat*a1◊e2◊ic3◊ion
Encodes automat Extra length
beyond automat.
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 21
22. .
Information Retrieval
Ranked Retrieval
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 22
23. .
Information Retrieval
Ranked retrieval
• Thus far, our queries have all been Boolean.
• Good for expert users
• Also good for applications: Applications can
easily consume 1000s of results.
– Not good for the majority of users.
– Most users incapable of writing Boolean queries (or
they are, but they think it’s too much work).
• Most users don’t want to wade through 1000s of
results.
– This is particularly true of web search
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 23
24. .
Term Weighting
• Term frequency and Inverse document frequency
– TF +log10 tf t,d ,
1 if tf t,d >0
wt,d =
0, otherwise
– IDF: the number of docs in the collection that contain a term t
idf t = log10 N/df t
• td-idf weighting
– The tf-idf weight of a term is the product of its tf weight and its idf weight
w t ,d = (1 + log tf t ,d ) × log10 N / df t
• td-idf is the best known weighting scheme in
information retrieval
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 24
25. .
Vector space model for scoring
– Represent the query as a weighted tf-idf vector
– Represent each document as a weighted tf-idf vector
– Compute the cosine similarity score for the query vector
and each document vector
∑
V
q •d q d qi d i
cos( q , d ) = = • = i =1
q d
qd
∑i =1 qi2 ∑
V V
i =1
d i2
– Rank documents with respect to the query by score
– Increases with the number of occurrences within a
document
– Increases with the rarity of the term in the collection
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 25
26. .
Providing heuristics methods
for
Speeding up Vector Space Scoring & Ranking
– Many of these heuristics achieve their speed at risk
of not finding quite top K documents matching query
• Efficient Scoring & ranking
1. Inexact top K document retrieval
2. Index Elimination
3. Champion lists
4. Static quality scores
• We want top-ranking documents to be both relevant and
authoritative
• Relevance is being modeled by cosine scores
• Authority is typically a query-independent property of a
document
• Assign a query-independent quality score in [0,1] to each
document d, Denote this by g(d)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 26
27. .
Providing heuristics methods
for
Speeding up Vector Space Scoring & Ranking(Cont.)
5 - Cluster pruning: preprocessing
• Pick √N docs at random: call these leaders
• For every other doc, pre-compute nearest leader
– Docs attached to a leader: its followers;
– Likely: each leader has ~ √N followers.
• Process a query as follows:
– Given query Q, find its nearest leader L.
– Seek K nearest docs from among L’s followers
– Net score for a document d
• net-score can be computed as combination of cosine
relevance and authority e.g. net-score(q,d) = g(d) +
cosine(q,d)
• Top K by net score – fast methods
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 27
28. .
Cluster Pruning
Query
Leader Follower
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 28
29. .
Parametric and zone indexes
• In fact documents have multiple parts, some with
special semantics:
– Author, Title, Date of publication, Language, Format, etc.
• These constitute the metadata about a document
• We sometimes wish to search by these metadata
• Field or parametric index: postings for each field value
– Field query typically treated as conjunction
• A zone is a region of the doc that can contain an
arbitrary amount of text e.g., Title, Abstract,
References …
– Build inverted indexes on zones as well to permit
querying
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 29
30. .
Example zone indexes
Encode zones in dictionary vs. postings.
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 30
31. .
Tiered indexes
– Tiered indexes
• Break postings up into a hierarchy of lists
– Most important
– …
– Least important
• Can be done by g(d) or another measure
• Inverted index thus broken up into tiers of decreasing
importance
• At query time use top tier unless it fails to yield K docs
– If so drop to lower tiers
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 31
32. .
Example tiered index
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 32
33. .
A Complete Search System
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 33
34. .
Evaluating
Search Engine
(Ranked Retrieval Method)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 34
35. .
Measures for a search engine
Which parameters are very important in SE
– How fast does a search engine index
– How fast does a search engine search
– Expressiveness of query language
– Uncluttered User Interface(UI)
– Is it free?
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 35
36. .
The key measure
User happiness
• Useless answers won’t make a user happy
• Need a way of quantifying user happiness
• Issue: who is the user we are trying to make happy?
– Web engine
– eCommerce site
– Enterprise (company/govt/academic)
• Happiness: elusive to measure
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 36
37. .
Evaluation of unranked retrieval
– Precision: fraction of retrieved docs that are relevant =
P( relevant | retrieved )
– Recall: fraction of relevant docs that are retrieved =
P( retrieved | relevant )
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
• Precision P = tp/(tp + fp)
• Recall R = tp/(tp + fn)
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 37
38. .
Evaluation of unranked retrieval (Cont.)
• What about Accuracy
– The accuracy of an engine: the fraction of
classifications that are correct
– Accuracy is a used in machine learning
classification work
– Why is this not a very useful evaluation measure
in IR?
– How to build a 99.9999% accurate search engine
on a low budget….
38
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 38
39. .
Evaluation of unranked retrieval (Cont.)
• F measure
– Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
1 ( β 2 + 1) PR
F= =
1
α + (1 − α )
1 β 2P + R
P R
– People usually use balanced F1 measure i.e., with β = 1
or α = ½
– For F1 the best value is 1 and the worst value is 0
39
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 39
40. .
Evaluation of Ranked Retrieval
• By taking various numbers of the top returned
documents (levels of recall), the evaluator can produce
a precision-recall curve
• We can determine a value between the points using
Interpolation
• 11-point interpolated average precision
• Other methods: Mean average precision (MAP) and R-
precision
40
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 40
41. .
A precision-recall curve
1.0
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
41
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 41
42. .
Typical (good) 11 point precisions
• SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
1
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
42
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 42
43. .
Relevance Feedback (RF)
for
Query Refinement
In
Search Engine
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 43
44. .
Relevance Feedback
• user feedback on relevance of docs in initial set of
results
– User issues a (short, simple) query
– The user marks some results as relevant or non-relevant.
– The system computes a better representation of the
information need based on feedback.
– Relevance feedback can go through one or more
iterations.
• Idea: it may be difficult to formulate a good query when you
don’t know the collection well, so iterate
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 44
45. .
Relevance Feedback: Example
• Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 45
46. .
Results for Initial Query
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 46
47. .
Relevance Feedback
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 47
49. .
Key concept: Centroid
• The centroid is the center of mass of a set of
points
• Recall that we represent documents as points in
a high-dimensional space
• Definition: Centroid
1
µ (C ) = ∑d
| C | d∈C
where C is a set of documents.
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 49
50. .
Rocchio Algorithm
• The Rocchio algorithm uses the vector space model to
pick a relevance fed-back query
• Rocchio seeks the query q opt that maximizes
qopt = arg max [cos( q , µ (Cr )) − cos( q , µ (Cnr ))]
q
• Tries to separate docs marked relevant and non-
relevant 1 1
qopt =
C ∑d j −
C ∑d j
r d j ∈Cr nr d j ∉Cr
• Problem: we don’t know the truly relevant docs
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 50
51. .
Rocchio 1971 Algorithm (SMART)
• Used in practice:
1 1
q m = α q0 + β
Dr ∑d j −γ D ∑dj
d j ∈Dr nr d j ∈Dnr
• Dr = set of known relevant doc vectors
• Dnr = set of known irrelevant doc vectors
– Different from Cr and Cnr !
• qm = modified query vector; q0 = original query
vector; α,β,γ: weights (hand-chosen or set
empirically)
• New query moves toward relevant documents and
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 51
52. .
The Theoretically Best Query
x x
x x
o x x
x x x x
x x
o x
o x
o x x
∆ o o x
x
x non-relevant documents
Optimal
query o relevant documents
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 52
53. .
Relevance feedback on initial query
Initial
x x
query x
o x
∆ x x
x x
x x
o x
x o∆
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 53
54. .
Relevance Feedback in vector spaces
• We can modify the query based on relevance
feedback and apply standard vector space model.
• Use only the docs that were marked.
• Relevance feedback can improve recall and
precision
• Relevance feedback is most useful for increasing
recall in situations where recall is important
– Users can be expected to review results and to
take time to iterate
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 54
55. .
Relevance feedback revisited
• In relevance feedback, the user marks a number of
documents as relevant/nonrelevant.
• We then try to use this information to return better
search results.
• Suppose we just tried to learn a filter for nonrelevant
documents
• This is an instance of a text classification problem:
– Two “classes”: relevant, nonrelevant
– For each document, decide whether it is relevant or
nonrelevant
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 55
56. .
Text Classification
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 56
57. .
Classification Methods #1
Manual classification
• Used by Yahoo! (originally; now present but
downplayed), Looksmart, about.com, ODP,
PubMed
• Very accurate when job is done by experts
• Consistent when the problem size and team is
small
• Difficult and expensive to scale
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 57
58. .
Classification Methods #2
Automatic document classification
• Hand-coded rule-based systems
– One technique used by CS dept’s spam filter,
Reuters, CIA, etc.
– Companies (Verity) provide “IDE” for writing such
rules
– Accuracy is often very high if a rule has been carefully
refined over time by a subject expert
– Building and maintaining these rules is expensive
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 58
59. .
Classification Methods #3
Supervised learning
• Supervised learning of a document-label
assignment function
– Many systems partly rely on machine learning
• k-Nearest Neighbors (simple, powerful)
• Naive Bayes (simple, common method)
• Support-vector machines (new, more powerful)
• No free lunch: requires hand-classified training data
• But data can be built up (and refined) by amateurs
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 59
60. .
References
• Introduction to Information Retrieval-2008
• Managing Gigabytes-1999
Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 60
Editor's Notes
SMART: Cornell (Salton) IR system of 1970s to 1990s.
Just as we modified the query in the vector space model, we can also modify it here. I’m not aware of work that uses language model based Ir this way.