SlideShare a Scribd company logo
1 of 95
Download to read offline
An introduction to system-oriented evaluation
in Information Retrieval
Mounia Lalmas
Outline
o  What to evaluate in IR
o  Test collection methodology
-  Document, information need, query, relevance
-  TREC
o  Precision and recall
-  Average precision, interpolated, mean average precision (MAP)
-  P@r, R-Precision, MRR
-  E and F measures
o  Other measures (DCG, bpref)
o  Significance testing
o  Large-scale evaluation (web search & clicks)
o  Evaluating classifiers
2
Information Retrieval = IR
IR vs. Search
Outline
o  What to evaluate in IR
o  Test collection methodology
-  Document, information need, query, relevance
-  TREC
o  Precision and recall
-  Average precision, interpolated, mean average precision (MAP)
-  P@r, R-Precision, MRR
-  E and F measures
o  Other measures (DCG, bpref)
o  Significance testing
o  Large-scale evaluation (web search & clicks)
o  Evaluating classifiers
3
Information Retrieval = IR
IR vs. Search
Evaluation in general versus evaluation in IR
o  Evaluating a system in computer science is often concerned with
time and space è system performance
o  With large collections of documents, system performance is still very
important
o  However, in IR, we care a lot about retrieval performance: are the
retrieved documents “relevant” to a “user information need”?
4
Why do we need to evaluate an IR system?
o  The user wants to find recipes about
“couscous” as cooked in various
countries
o  User uses 2 IR systems
o  How we can say which one is better?
5
Acknowledgements
6
These slides were based on
- Evaluation lecture @ QMUL; Thomas Roelleke & Mounia Lalmas
- Lecture 8: Evaluation @ Stanford University; Pandu Nayak & Prabhakar Raghavan
- Retrieval Evaluation @ University of Virginia; Hongnig Wang
- Lectures 11 and 12 on Evaluation @ Berkeley; Ray Larson
- Evaluation of Information Retrieval Systems @ Penn State University; Lee Giles
o  Information Retrieval, 2nd edition, C.J. van Rijsbergen (1979)
o  Introduction to Information Retrieval, C.D. Manning, P. Raghavan & H. Schuetze (2008)
o  Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edition; R.
Baeza-Yates & B. Ribeiro-Neto (2011)
What to evaluate in IR
o  coverage of the collection: extent to which the system includes
relevant material
o  time lag (efficiency): average interval between the time a query is
submitted and the answer is given
o  presentation of the output
o  effort involved by user in obtaining answers to a query
o  recall of the system: proportion of relevant documents retrieved
o  precision of the system: proportion of the retrieved documents that
are actually relevant
7
o  coverage has to do with the quality of the collection
o  efficiency in terms of speed, memory usage, etc
o  presentation has to do with interface and visualisation
issues
o  effort has to do with user issues, e.g. user satisfaction.
o  recall and precision have to do with retrieval effectiveness
or effectiveness for short è system-oriented evaluation
8
What to evaluate in IR
System-oriented evaluation
o  Measuring effectiveness has been the most predominant in IR evaluation
o  Test collection methodology
- Benchmark (dataset) upon which effectiveness is measured and compared
- Dataset tells for a given query what are the relevant documents
o  Metrics to measure effectiveness
- Precision and recall, and variants
- E and F measures
- Others (DCG, bpref)
9
Test Collection methodology
o  Compare retrieval performance using a test collection
- Document collection, that is the document themselves. The document collection
depends on the task, e.g. evaluating web retrieval requires a collection of HTML
documents.
- Queries, which simulate real user information needs.
- Relevance judgements, stating for a query the relevant documents.
o  To compare the performance of two techniques:
- each technique used to answer queries
- results (set or ranked list) compared using some effectiveness performance measure
- most common measures are precision and recall
o  Usually use multiple measures to get different views of performance
o  Usually test with multiple collections as performance can be collection
dependent
10
Informa(on	need,	query	and	relevance	
o  The information need is translated into a query
o  Relevance is assessed relative to the information need not the query
- Information need: I am looking for information on what are the best places to go on
holiday near the beach and play tennis
- Query: tennis beach holiday
- Evaluate whether the document addresses the information need, not whether it has the
three words “tennis”, “beach” and “holiday”
Sec. 8.1
11
Relevance … as defined in system-oriented
evaluation
o  A document is relevant if it “has significant and demonstrable bearing
on the matter at hand”.
o  There are common assumptions about the nature of relevance in
system-centred evaluation:
- Objectivity: everybody agree on whether a document is relevant or not to a
query
- Topicality: relevance is about whether the document is about the topic
expressed in the query
- Binary nature: either a document is relevant or not
- Independence: the fact that a document is relevant to a query has no effect
of the relevance of another document for that same query
12
Relevance is difficult to define satisfactorily
o  A document is relevant within the context of a query
- Who judges the relevance? è humans not very consistent (see next slide)
- Is the document useful? è Utility
- Judgment on whether a document is relevant or not depend on more than document
and query
o  With real collections, we never know the full set of relevant documents
o  Retrieval model incorporates notion of relevance
- Satisfiability of a logical expression in Boolean model
- P(relevance | query, document) in BIRM
- Similarity to query in VSM
- P(query generated | document model) in LM
13
Kappa measure for inter-judge relevance
agreement
o  Kappa measure
- Agreement measure among judges (assessing document
relevance)
- Designed for categorical judgments (relevant or not)
o  Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
o  P(A) – proportion of time judges agree
o  P(E) – what agreement would be by chance
o  Kappa = 0 for chance agreement, 1 for total agreement
Sec. 8.5
14
Kappa Measure: Example
Number of documents
assessed
Judge 1 Judge 2
300 Relevant Relevant
70 Non-relevant Non-relevant
20 Relevant Non-relevant
10 Non-relevant Relevant
Sec. 8.5
15	
JudgesagreeJudgesdisagree
Kappa measure: Example
P(A)	=	370/400	=	0.925	
P(non-relevant)	=	(10+20+70+70)/800	=	0.2125	
P(relevant)	=	(10+20+300+300)/800	=	0.7875	
P(E)	=	0.2125^2	+	0.7875^2	=	0.665	
Kappa	=	(0.925	–	0.665)/(1-0.665)	=	0.776	
	
Kappa	>	0.8	=	good	agreement	
0.67	<	Kappa	<	0.8	è	“tentaGve	conclusions” 	
	
For	>2	judges	è	average	pairwise	kappas		
Sec. 8.5
16
Impact of inter-judge agreement on IR systems
comparisons
o  Impact	on	absolute	effecGveness	performance	measure	can	be	
significant	(0.32	vs	0.39)	
o  But	liVle	impact	on	ranking	of	different	systems	or	rela(ve	
effecGveness	performance	
o  If	we	just	want	to	know	if	IR	system	A	is	beVer	than	IR	system	B	
è	test	collecGon	methodology	gives	reliable	comparison	
Sec. 8.5
17
Find the relevant documents in the collection
o  Did the IR system find all relevant document?
o  To answer accurately, we need complete judgments
- i.e., “yes,” “no,” or some score for every query-document pair
o  For small test collections, we can review all documents for all queries
o  Not practical for large or even medium-sized collection
- TREC collections have millions of documents
o  Pooling method
o  Click-based evaluation in web search (later in the lecture)
18
Test collection creation
o  Manual method:
- Every document in the collection is judged against every query by one of several judges
(human assessors)
- This is feasible for small document collection.
o  Pooling method (used for large document collection):
- The queries are run against several IR systems first
- The top, for example 100, documents retrieved by each system are pooled together
- The pool is then judged for relevance (by human assessors)
- This is what TREC does
o  Query logs (web search) è see later about “evaluation with clicks”
19
Sample test collections (ad hoc retrieval)
Characteristics Cranfield CACM ISI West TREC2
Collection size (docs) 1400 3204 1460 11953 742611
Collection size (MB) 1.5 2.3 2.2 254 2162
Year created 1968 1983 1983 1990 1991
Unique stems 8226 5493 5448 196707 1040415
Stem occurrences 123200 117578 98304 21798833 243800000
Max within document
frequency
27 27 1309
Mean document length
(words)
88 36.5 67.3 1823 328
Number of queries 225 50 35 44 100
20
ad hoc retrieval: query, document, ranking
CIS
o  1239 documents about cystic fibrosis from MEDLINE collection
o  Fields: author, title, source, major and minor subjects, abstracts, references and
citations
o  100 queries, developed by relevance judges
o  Unusual features:
-  4 judges per document per query (3 experts,
1 medical bibliographer)
-  3 levels of relevance (0-2)
-  Combined relevance on scale of 0-8
222 2
221 2
211 2
111 2
222 1
221 1
211 1
111 1
000 0
21
Added so we do not forget history
CACM
o  3024 articles on computer science from CACM, 1958 - 1979
o  Fields: author, date, word stems for titles and abstracts, categories, direct
referencing, bibliography coupling, number of co-citations for each pair of articles
o  52 queries, each with 2 Boolean formulations
o  Unusual features:
- Citation links to other documents, so often used for hypertext-type
experiments
22
Added so we do not forget history
TREC o  Text REtrieval Conference/
Competition
- http://trec.nist.gov
- Run by NIST (National
Institute of Standards &
Technology)
o  Collections: > Terabytes,
o  Datasets
- Newswire & full text news
(AP, WSJ, Ziff, FT)
- Government documents
(federal register,
Congressional Record)
- Radio Transcripts (FBIS)
- Web “subsets”
- …
23
Tracks
change from
year to year
24
Queries & relevance judgments at TREC
o  Queries devised and judged by “information
specialists” èTREC Topics
o  Relevance judgments done only for those documents
retrieved and not entire collection!
- E.g. merge top 100 retrieved documents from systems experimented
with (TREC participants)
- Pooling method
25
Example (excerpt) of a TREC document
<doc>
<docno> WSJ880406-0090 </docno>
<hl> AT&T Unveils Services to Upgrade Phone Networks
Under Global Plan </hl>
<author> Janet Guyon (WSJ Sta) </author>
<dateline> New York </dateline>
<text>
American Telephone & Telegraph Co. introduced the rest of a new generation of phone
services with broad ...
</text>
</doc>
26
Example (excerpt) of a TREC topic
<top>
<num> Number: 168 </docno>
<title> Topic: Financing AMTRAK
<desc> Description
A document will address the role of the Federal Government in financing the operation of
the National Railroad Transportation Corporation (AMTRAK)
<nar> Narrative:
A relevant document must provide information on the government's responsibility to make
AMTRAK an economically viable entity. It could also discuss the privatisation of AMTRAK
as an alternative to continuing government subsidies. Documents comparing government
subsidies given to air and bus transportation with those provided to AMTRAK would also be
relevant.
</top>
27
TREC legacy
o  Pros:
- made research systems scale to large collections (pre-WWW)
- allows for controlled comparisons
o  Cons:
- emphasis on high recall, often unrealistic for what most users want è but
recall-oriented search exist (patent retrieval, e-discovery)
- very long queries, unrealistic è systems optimized for long queries and
hence perform worse for shorter, more realistic queries
- focus on batch ranking (one-off result) rather than interaction (but session track
was introduced to evaluate a “user search session”)
28
Others evaluation forums
o  CLEF (Cross-Language Evaluation Forum)
o  NCTIR (NII Testbeds and Community for Information access Research)
o  FIRE (Forum for Information Retrieval Evaluation)
o  INEX (The Initiative for the Evaluation of XML retrieval)
29
Effectiveness
o  We recall that the goal of an IR system is to retrieve as
many relevant documents as possible and as few non-
relevant documents as possible.
o  Evaluating the above consists of a comparative evaluation
of technical performance of IR system(s):
- In traditional IR, technical performance means the effectiveness of the IR
system: the ability of the IR system to retrieve relevant documents and
suppress non-relevant documents
- Effectiveness is measured by the combination of recall and precision
30
Intuition behind precision and recall
o  Collection of 10,000 documents, 50 relevant to a given topic
o  Ideal system finds these 50 documents and reject all others
o  An actual system likely identifies 25 documents; 20 are relevant
and 5 were on other topics
Precision: 20/25 = 0.8 (80% of retrieved document are relevant)
Recall: 20/50 = 0.4 (40% of the relevant document are found)
31
Measuring Precision and Recall
Precision is easy to measure:
o  Look at each document retrieved and decide whether it is relevant or not
o  In previous example, only the 25 documents that are found need to be
examined
Recall is difficult to measure:
o  To know all relevant items, we must go through entire collection, looking
at every document to decide if it is relevant or not
o  In previous example, all 10,000 documents must be examined! è remember
the pooling method at TREC
32
Recall / Precision
Document collection
Retrieved RelevantRetrieved and relevant
Knowing which documents are relevant to which queries comes from the test collection
For a given query, the document collection can be divided into three sets:
the set of retrieved document, the set of relevant documents,
and the rest of the documents.
33
Recall / Precision
In the ideal case, the set of retrieved documents is equal to the set of relevant
documents. However, in most cases, the two sets will be different.
This difference is formally measured with precision and recall.
34
Document collection
Retrieved RelevantRetrieved and relevant
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
Retrieved vs. Relevant Documents
Relevant
Very high precision, very low recall
retrieved
35
High precision rate is achieved by returning documents that we know for sure
are relevant à Is this a good idea?
Retrieved vs. Relevant Documents
Relevant
High recall, but low precision
retrieved
36
100% recall can be achieved by returning all documents in the collection
à This is for sure a bad idea!
Retrieved vs. Relevant Documents
Relevant
Very low precision, very low recall (0 for both)
retrieved
37
Total failure!
Retrieved vs. Relevant Documents
Relevant
High precision, high recall
retrieved
38
The perfect scenario!
Recall and Precision
The above two measures do not take into account where the relevant documents
are retrieved, this is, at which rank (crucial since the output of most IR systems
is a ranked list of documents).
This is very important because an effective IR system should not only retrieve
as many relevant documents as possible and as few non-relevant documents as
possible, but also it should retrieve relevant documents before the non-relevant
ones.
39
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
Recall and Precision
o  Let us assume that for a given query, the following documents are relevant (10 relevant
documents)
{d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
o  Now suppose that the following documents are retrieved for that query:
For each relevant document (in red bold), we calculate the precision value and the recall value. For
example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the precision
is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant documents
being 10), so recall is 2/10.
rank doc precision recall rank doc precision recall
1
2
3
4
5
6
7
d123
d84
d56
D6
d8
d9
d511
1/1
2/3
3/6
1/10
2/10
3/10
8
9
10
11
12
13
14
d129
d187
d25
d48
d250
d113
d3
4/10
5/14
4/10
5/10
40
Recall and Precision
o  For each query, we obtain pairs of recall and precision values
- In our example, we would obtain
(1/10, 1/1) (2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) …
which are usually expressed in %
(10%, 100%) (20%, 66.66%) (30%, 50%) (40%, 40%), (50%, 35.71%) …
- This can be read for instance: at 20% recall, we have 66.66% precision; at 50%
recall, we have 35.71% precision
The pairs of values are
plotted into a graph, which
has the following curve
Recall (%)
Precision (%)
10 20 30 40 50 60 70 80 90 100
100
90
80
70
60
50
40
30
20
10
41
Recall and Precision
o  We have shown how to derive the recall and precision curve for a
given query
o  Now we describe how using the above for all queries, the
effectiveness of an IR system is evaluated and thus compared to
other IR systems.
o  Note that we can also compare the same system, but with different
versions (e.g. different parameters are used). The idea here is to
find out the best version of the IR system.
42
The complete methodology
For each IR system / IR system version
1.  For each query in the test collection
a.  We first run the query against the system to obtain a ranked list of retrieved
documents
b.  We use the ranking and relevance judgements to calculate recall/precision pairs
2.  Then we average recall / precision values across all queries, to
obtain an overall measure of the effectiveness
43
Averaging across queries
o  Hard to compare precision and recall graphs or tables for
individual queries (too much data)
- Need to average over many queries
o  Two main types of averaging
- Macro-average: each query is a point in the average
- Micro-average: each relevant document is a point in the average
- Macro is mostly used (all queries count equally)
44
(Macro) Interpolated average precision
o  Average precision at standard recall points
o  For a given query, compute precision and recall point for every relevant
document
o  Interpolate precision at standard recall levels
- 11-pt is usually 100%, 90%, 80%, ..., 10%, 0%
o  Average over all queries to get average precision at each recall level
45
Interpolation
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
recall
Interpolated valueObserved value
precision
It is often the case that recall values are not given for standard recall values (10%,
20%, ….). We therefore need to interpolate to obtain standard recall values.
For example, the
value is 25%, and is
interpolated to the
nearest standard
recall value on the
right, that is 30%.
46
Interpolated average precision
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
recall
query1
query 2
average
We have precision values at standard recall values for two queries. The
precision values for query 1 are higher than those for query 2. This means that the
effectiveness of the IR system is better for query 1 than for query 2. We can plot
the average of the two queries.
47
precision
Averaging
The same information
can be displayed in
a table.
48
Precision in %
Recall in % Query 1 Query 2 Average
10 80 60 70
20 80 50 65
30 60 40 50
40 60 30 45
50 40 25 32.5
60 40 20 30
70 30 15 30
80 30 10 22.5
90 20 5 11.5
100 20 5 11.5
Comparison of systems
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
recall
precision
system 1
system 2
We can now compare IR systems / system versions. For example, here we see that at low
recall, system 2 is better than system 1, but this changes from recall value 30%, etc. It is
common to calculate an average precision value across all recall levels, so that to have a
single value to compare.
49
Averaging across averages
o  Average interpolated recall levels to get single result
- Called “interpolated average precision”
-  Not used much anymore; “mean average precision” more common
-  Values at specific interpolated points still commonly used
o  Mean average precision (MAP)
- (“Average average precision” sounds weird)
- Average precision over all relevant documents, non-interpolated
- Reward systems that retrieve relevant documents quickly (highly ranked)
50
Mean Average Precision
Consider rank position of each relevant document (n) for given query
r1, r2, … rn
Compute precision@r (denoted P@r) for each r1, r2, … rn
Average precision = average of P@r for given query
MAP is Average Precision across multiple queries
1
3
.(
1
1
+
2
3
+
3
5
) ⇡ 0.76
51
Mean Average Precision (MAP)
52
average precision query 1 (AP) = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62
average precision query 2 (AP) = (0.5 + 0.4 + 0.43)/3 = 0.44
mean average precision (MAP) = (0.62 + 0.44)/2 = 0.53
More	about	mean	average	precision	(MAP)	
o  If a relevant document is not retrieved, precision corresponding to that
relevant document is zero
o  Most commonly used measure in research papers … with issues
o  Not so good for web search evaluation (precision oriented)
- MAP assumes user is interested in finding many relevant documents
53
TREC (trec_eval) evaluation results
Recall Level Precision Average
Recall Precision
0.0
0.1
…
1.0
0.61
0.45
…
0.003
average precision over all relevant documents
Non-interpolated (MAP) 0.23
54
Average precision per query
1.0
-1.0
0.0
200 201 202 203 204 …… Topic ids
Difference average precision
55
A system may perform badly for some information needs (MAP = 0.1) and excellently
on others (MAP = 0.7)
èoften the case that variance in performance of same system across queries is much greater
than variance of different systems on the same query
There	are	easy	informaGon	needs	and	hard	ones!
Rank-based measures
o Binary relevance
- Mean Average Precision (MAP)
- P@r
- R-Precision
- Mean Reciprocal Rank (MRR)
- bpref
o Multiple levels of relevance
- Normalized Discounted Cumulative Gain (NDCG)
56
P@r or Precision @ rank r
Set a rank threshold r
Compute % relevant documents in top r
Ignores documents ranked lower than r
P@3 = 2/3
P@4 = 2/4
P@5 = 3/5
actual performance as a user
might see it
often used in web retrieval
used at fixed rank values:
P@5, P@10
57	
Note the slight difference with P@r in slide 51
R-Precision
o  Precision after R documents are retrieved
o  R = number of relevant documents for the topic
o  De-emphasize exact ranking of retrieved relevant documents, which can
be useful for topics with large number of relevant documents
o  Perfect	system	could	score	1.0	
o  Average R-precision
- Example: 2 topics, with 50 and 10 relevant documents respectively.
- Assume IR system return 17 relevant documents in the top 50 documents for
1st topic and 7 relevant documents in top 10 for 2nd topic
- Average R-precision for this IR system is (17/50 + 7/10) / 2 = 0.52
58
Mean Reciprocal Rank (MRR)
o Suppose there is only one relevant document
o Scenarios: known-item search, navigational queries, looking for a fact
o Search duration à rank of the answer
measures a user effort in finding that one and only document
Consider rank position, r, of first relevant document
Reciprocal Rank score =
MRR is the mean RR across multiple queries
1
r
59
E-measure
o  Used to emphasize precision (or recall)
- Essentially a weighted average of precision and recall
- Large α increases importance of precision
o  Can be transformed by α = 1/(β2+1) leading to
- When β =1 (α=1/2) equal importance of precision and recall
- Normalised symmetric difference of retrieved and relevant sets
60
E = 1
1
↵ 1
P + (1 ↵) 1
R
E = 1
( 2
+ 1)PR
2P + R
Symmetric Difference and E
A is the retrieved set of documents
B is the set of relevant documents
P = |A∩B|/|A|
R = |A∩B|/|B|
A⊗B (the symmetric difference) is the shaded area
A⊗B = |A∪B|- |A∩B|
= |A|+|B|-2|A∩B|
Eβ=1=1-(2PR / (P+R)) = (P+R-2PR)/(P+R)
= …
= A⊗B / (|A|+|B|)
symmetric difference
normalised
61
A
B
A∩B
F measure
o  F = 1-E often used
- Good results mean larger values of F
- “F1” measure is popular: F with β=1
- particular popular with evaluating classification approaches
harmonic mean
of P and R
62
F = 1 E =
( 2
+ 1)PR
2P + R
F1 =
2PR
P + R
=
1
1
2 ( 1
R + 1
P )
Harmonic	mean	is	a	conservaGve	average
F measure, geometric interpretation
A is the retrieved set of documents
B is the set of relevant documents
P = |A∩B|/|A|
R = |A∩B|/|B|
A
B
A∩B
63
F =1 = 2PR/(P + R)
= 2
|A  B|2
|A| + |B|
/(|A  B|(
1
|A|
+
1
|B|
))
=
2|A  B|
|A| + |B|
Relation to Contingency Table
Why is accuracy not much used in IR in large documents collections?
- Most document are NOT relevant
- Most documents are NOT retrieved
- Inflates the accuracy value
Document is
Relevant
Document is NOT
relevant
Document is
retrieved a b
Document is NOT
retrieved c d
64
Accuracy : (a + d)/(a + b + c + d)
Precision : a/(a + b)
Recall : a/(a + c)
fair	
fair	
Good	
Are all relevant documents “equally” relevant?
65	
Excellent
Discounted Cumulative Gain (DCG)
o  Popular measure for evaluating web search
o  Two assumptions:
- Highly relevant documents are more useful than marginally relevant
documents
- The lower the ranked position of a relevant document, the less useful it is for
the user, since it is less likely to be examined
66
Discounted Cumulative Gain (DCG)
o  Uses graded relevance as a measure of usefulness, or gain, from
examining a document
o  Gain is accumulated starting at the top of the ranking and can be
reduced, or discounted, at lower ranks
o  Typical discount is 1/log(rank)
- With base 2, the discount at rank 4 is 1/2 , and at rank 8 it is 1/3
67
Summarize a Ranking with DCG
o  Relevance judgments in a scale of [0,r] with r>2
o  Cumulative Gain (CG) at rank n
- Let the ratings of the n documents be r1, r2, …rn (in ranked order)
- CG = r1+r2+…+rn
o  Discounted Cumulative Gain (DCG) at rank n
- DCG = r1 + r2/log22 + r3/log23 + … + rn/log2n (We may use any base for the logarithm)
68
DCGn = rel1 +
nX
i=2
reli
log2i
DCG Example
o  10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
o  discounted gain (CG):
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
o  discounted cumulative gain (DCG):
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
69
0
5
10
15
0 2 4 6 8 10 12
rank
Summarize a Ranking with NDCG
o  Normalized Discounted Cumulative Gain (NDCG) at rank n
- Normalize DCG at rank n by the DCG value at rank n of the ideal ranking
- Ideal ranking would first return the documents with the highest relevance level, then
the next highest relevance level, and so on (we get Max DCG)
o  Normalization useful for contrasting queries with varying numbers of
relevant documents
o  NDCG popular in evaluating web search70
NDCG =
DCG
MaxDCG
NDCG Example
rank	i	
Ideal	system	(IS)	 System	1	(S1)	 System	2	(S2)	
Document	
Order	
ri	
Document	
Order	
ri	
Document	
Order	
ri	
1	 d4	 2	 d3	 2	 d3	 2	
2	 d3	 2	 d4	 2	 d2	 1	
3	 d2	 1	 d2	 1	 d4	 2	
4	 d1	 0	 d1	 0	 d1	 0	
NDCGIS=1.00	 NDCGS1=1.00	 NDCGS2=0.9203	
4 documents: d1, d2, d3, d4
71
DCGIS = 2 + (
2
log22
+
1
log23
+
0
log24
) = 4.6309
DCGS1 = 2 + (
2
log22
+
1
log23
+
0
log24
) = 4.6309
DCGS2 = 2 + (
1
log22
+
2
log23
+
0
log24
) = 4.2619
MaxDCG = DCGIS = 4.6309
Problem with the test collection methodology
o  Building larger test collections along with complete relevance judgment is difficult
or impossible
- require assessor time, which is very expensive
- require many diverse retrieval “runs”
o  Recall is difficult if not impossible to get correctly as there is no way we can find all the
relevant documents for each query
o  Precision at top n often not stable enough
o  Issue:
- Non-judged documents are assumed non-relevant
- Can we reuse the test collection later on?
72
bpref measure
o  Binary preference-based measure
-  Introduced in 2004
-  Unlike MAP, P@10, and recall and precision, only uses information from judged documents
o  A function of how frequently relevant documents are retrieved before non-
relevant documents.
R is the number of judged relevant documents, r is a relevant retrieved
document, and n is a member of the first R irrelevant retrieved documents. Non
judged documents are ignored.
73
bpref =
1
R
X
r
1
n ranked higher than r
R
bpref measure
o  When comparing systems over test collections with complete judgments, MAP
and bpref are reported to be equivalent
o  With incomplete judgments, bpref is shown to be more stable
-  We look at what happen when we use less queries, more queries
-  We look at what happen when we swap documents in the ranking
74
bpref - Example
Retrieved result set with D2 and D5 being relevant:
D1
D2
D3 not judged
D4
--------
D5
D6
D7
D8
D9
D10 R=2
bpref= 1/2 [1-(1/2)]75
bpref - Example
Retrieved result set with D2, D5 and D7 are relevant:
D1
D2
D3 not judged
D4 not judged
D5
D6
D7
D8
----------
D9
D10 R=3
bpref= 1/3 [(1 -1/3) + (1 -1/3) + (1 -2/3)]76
bpref Example
Retrieved result set with D2, D4, D6 and D9 are relevant:
D1
D2
D3
D4
D6
D7
D8
----------
D9
D10 R=4
bpref= 1/4 [(1-1/4) + (1 -2/4) + (1 -2/4)]
77
Evaluating interaction with the IR systems
o  Empirical data involving human users is time consuming to
gather and difficult to draw universal conclusions from
o  Evaluation metrics for user interaction (interface)
- Time required to learn the system
- Time to achieve goals on benchmark tasks
- Error rates
- Retention of the use of the interface over time
- User satisfaction
78
Why significance testing
o  System A beats System B on one query
-  Is it just a lucky query for System A?
-  Maybe System B does better on some other query?
-  Needs as many queries as possible
Empirical research suggests 25 is minimum needed
TREC tracks generally aim for at least 50 queries
o  Systems A and B identical on all but one query
-  If System A beats System B by enough on that one query, average will make A look better than B
As above could just be a lucky break for System A
-  Need A to beat B frequently to believe A is really better
o  System A is only 0.00001% better than system B
-  Even if true in all queries, does it mean much
o  Significance testing consider those issues
79
Significance tests
o  Are observed differences statistically different?
-  Make use of statistics
o  Generally we cannot make assumptions about underlying distribution
-  Most significance tests do make such assumptions
o  Significance tests are easier to do on single-valued effectiveness measures (MAP, bpref)
o  Example: Sign test
-  Do not require that data be normally distributed
-  For techniques A and B, compare average precision for each pair of results generated by queries in
the test collection
-  If difference is large enough, count as + or -, otherwise ignore
-  Use number of +’s and the number of significant differences to determine significance level
80
Measures for large-scale systems … web search
o  Typical user behavior in web search shown preference for high precision
o  Graded scales of relevance seem more useful than binary è NDCG
o  Recall difficult to measure on the web
-  Often use precision at top k, such as k=5, k =10, …
o  . . . or measures that reward you more for getting rank 1 right than for getting
rank 10 right è NDCG
o  Use non-relevance-based datasets such as click-through data (query logs)
o  A/B testing
81
A/B	tes(ng	
o  Test	a	a	single	new	“innovaGon”	
o  Have	most	users	use	old	system	
o  Divert	a	small	proporGon	of	traffic	(e.g.,	1%)	to	the	new	system	that	includes	
the	innovaGon	
o  Evaluate	with	an	“automaGc”	measure	like	click-through	rates	
o  Now	we	can	directly	see	if	the	innovaGon	does	improve	retrieval	performance	
(e.g.	click-through	rate)	
o  Probably	the	evaluaGon	methodology	that	large	search	engines	trust	most	
Sec. 8.6.3
82
Bias in where users click
#	of	clicks	received	
Strong position bias, so absolute click rates unreliable
83
Relative vs absolute ratings
	
	
Hard to conclude Result1 > Result3
Probably can conclude Result3 > Result2
User click
sequence
pairwise relative
rating instead of
individual rating
Assess in terms of
conformance with historical
pairwise preferences
recorded from user clicks
84
Comparing two rankings via clicks and
interleaving method
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
SVM software
SVM tutorial
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
SVM software
Query: [support vector machines]
System A System B
85	
(Joachims, 2002)
Interleave	the	two	rankings	and	remove	duplicates	
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
86
Count user clicks
87	
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
Clicks
Ranking A: 3
Ranking B: 1
ê
A, B
A
A
System A is
better than
System B
88
o  Focus on measuring its effectiveness rather than efficiency
o  We recall that:
- Effectiveness is the ability to make the right classification decision
- Efficiency is concerned with time and space requirement
Evaluation of classifiers
89
Evaluation of classifiers
o  After a classifier is constructed using a training set, the
effectiveness is evaluated using a test set
o  For each category ci, we calculate the following sets:
- TPi: true positives
-  FPi: false positives
-  TNi: true negatives
-  FNi: false negatives
90
True and false positives with respect to a
cageory
o  TPi: true positives with respect to category ci
- the set of documents that both the classifier and the previous
judgments (as recorded in the test set) classify under ci
o  FPi: false positives with respect to category ci
- the set of documents that the classifier classifies under ci, but the test
set indicates that they do not belong to ci
91
o  TNi: true negatives with respect to category ci
- both the classifier and the test set agree that the documents in
TNi do not belong to ci
o  FNi: false negatives with respect to category ci
- the classifier do not classify the documents in FNi under ci, but
the test set indicates that they should be classified under ci
True and false negatives with respect to a
cageory
92
Evaluation measures for classifiers
o  Precision with respect to category ci
o  Recall with respect to category ci
TPiFPi FNi
TNi
Classified ci
(what it returns)
Test Class ci
(what it should return)
Pi =
TPi
TPi + FPi
Ri =
TPi
TPi + FNi
93
Evaluation measures for classifiers
o  for obtaining estimates for precision and recall in the collection as
a whole, two different methods may be adopted:
- Micro-averaging
counts for true positives, false positives and false negatives for all categories are first
summed up
precision and recall are calculated using the global values
- Macro-averaging
average of precision (recall) for individual categories
94
Micro- vs macro-averaging
o  microaveraging and macroaveraging may give quite
different results, if the different categories have very
different generality
o  e.g. the ability of a classifier to behave well also on
categories with low generality (i.e. categories with few
positive training instances) will be emphasized by
macroaveraging
o  choice depends on the application
Conclusions … some few words
o  Here we solely focused on system-oriented evaluation. We should not
forget about user-oriented evaluation
o  Here we focus on batch-style evaluation. We should not forget that
search is part of a bigger task.
o  At the end, it is all about making the users “happy”. We should not forget
about long-term engagement.
o  Lots of work and research looked beyond precision and recall, in terms of
validations, extensions or alternatives
o  Lots of work such as “significance testing” so that we can be sure that IR
system A is indeed better than IR system B.
o  Here we focused on “document” and text. We should not forget
multimedia, mobile, social media, etc, where evaluating effectiveness
may mean something a bit different.
95

More Related Content

What's hot

Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 

What's hot (20)

Token, Pattern and Lexeme
Token, Pattern and LexemeToken, Pattern and Lexeme
Token, Pattern and Lexeme
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
SLIQ
SLIQSLIQ
SLIQ
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Recognition-of-tokens
Recognition-of-tokensRecognition-of-tokens
Recognition-of-tokens
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Text compression
Text compressionText compression
Text compression
 
Chapter 3 image enhancement (spatial domain)
Chapter 3 image enhancement (spatial domain)Chapter 3 image enhancement (spatial domain)
Chapter 3 image enhancement (spatial domain)
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Gabor Filter
Gabor FilterGabor Filter
Gabor Filter
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic coding
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Advanced data structures & algorithms important questions
Advanced data structures & algorithms important questionsAdvanced data structures & algorithms important questions
Advanced data structures & algorithms important questions
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwideweb
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 

Similar to An introduction to system-oriented evaluation in Information Retrieval

Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
k21jag
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
GESIS
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 

Similar to An introduction to system-oriented evaluation in Information Retrieval (20)

Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
 
information retrival evaluation.ppt
information retrival evaluation.pptinformation retrival evaluation.ppt
information retrival evaluation.ppt
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
 
Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objects
 
PEDSnet : 18 month summary on data integration and data quality
PEDSnet : 18 month summary on data integration and data qualityPEDSnet : 18 month summary on data integration and data quality
PEDSnet : 18 month summary on data integration and data quality
 
Web analytics presentation
Web analytics presentationWeb analytics presentation
Web analytics presentation
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinar
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Data Analytics all units
Data Analytics all unitsData Analytics all units
Data Analytics all units
 
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
 
Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
 

More from Mounia Lalmas-Roelleke

Tutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and OptimizationTutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and Optimization
Mounia Lalmas-Roelleke
 
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Mounia Lalmas-Roelleke
 
Story-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User EngagementStory-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User Engagement
Mounia Lalmas-Roelleke
 

More from Mounia Lalmas-Roelleke (20)

Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization
 
Tutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and OptimizationTutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and Optimization
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Search @ Spotify
Search @ Spotify Search @ Spotify
Search @ Spotify
 
Tutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerceTutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerce
 
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
 
Social Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the usersSocial Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the users
 
Advertising Quality Science
Advertising Quality ScienceAdvertising Quality Science
Advertising Quality Science
 
Describing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage DataDescribing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage Data
 
Story-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User EngagementStory-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User Engagement
 
Mobile advertising: The preclick experience
Mobile advertising: The preclick experienceMobile advertising: The preclick experience
Mobile advertising: The preclick experience
 
Predicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native AdvertisementsPredicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native Advertisements
 
Improving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival AnalysisImproving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival Analysis
 
Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...
 
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User EngagementA Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
 
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersPromoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
 

Recently uploaded

Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 

Recently uploaded (20)

Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 

An introduction to system-oriented evaluation in Information Retrieval

  • 1. An introduction to system-oriented evaluation in Information Retrieval Mounia Lalmas
  • 2. Outline o  What to evaluate in IR o  Test collection methodology -  Document, information need, query, relevance -  TREC o  Precision and recall -  Average precision, interpolated, mean average precision (MAP) -  P@r, R-Precision, MRR -  E and F measures o  Other measures (DCG, bpref) o  Significance testing o  Large-scale evaluation (web search & clicks) o  Evaluating classifiers 2 Information Retrieval = IR IR vs. Search
  • 3. Outline o  What to evaluate in IR o  Test collection methodology -  Document, information need, query, relevance -  TREC o  Precision and recall -  Average precision, interpolated, mean average precision (MAP) -  P@r, R-Precision, MRR -  E and F measures o  Other measures (DCG, bpref) o  Significance testing o  Large-scale evaluation (web search & clicks) o  Evaluating classifiers 3 Information Retrieval = IR IR vs. Search
  • 4. Evaluation in general versus evaluation in IR o  Evaluating a system in computer science is often concerned with time and space è system performance o  With large collections of documents, system performance is still very important o  However, in IR, we care a lot about retrieval performance: are the retrieved documents “relevant” to a “user information need”? 4
  • 5. Why do we need to evaluate an IR system? o  The user wants to find recipes about “couscous” as cooked in various countries o  User uses 2 IR systems o  How we can say which one is better? 5
  • 6. Acknowledgements 6 These slides were based on - Evaluation lecture @ QMUL; Thomas Roelleke & Mounia Lalmas - Lecture 8: Evaluation @ Stanford University; Pandu Nayak & Prabhakar Raghavan - Retrieval Evaluation @ University of Virginia; Hongnig Wang - Lectures 11 and 12 on Evaluation @ Berkeley; Ray Larson - Evaluation of Information Retrieval Systems @ Penn State University; Lee Giles o  Information Retrieval, 2nd edition, C.J. van Rijsbergen (1979) o  Introduction to Information Retrieval, C.D. Manning, P. Raghavan & H. Schuetze (2008) o  Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edition; R. Baeza-Yates & B. Ribeiro-Neto (2011)
  • 7. What to evaluate in IR o  coverage of the collection: extent to which the system includes relevant material o  time lag (efficiency): average interval between the time a query is submitted and the answer is given o  presentation of the output o  effort involved by user in obtaining answers to a query o  recall of the system: proportion of relevant documents retrieved o  precision of the system: proportion of the retrieved documents that are actually relevant 7
  • 8. o  coverage has to do with the quality of the collection o  efficiency in terms of speed, memory usage, etc o  presentation has to do with interface and visualisation issues o  effort has to do with user issues, e.g. user satisfaction. o  recall and precision have to do with retrieval effectiveness or effectiveness for short è system-oriented evaluation 8 What to evaluate in IR
  • 9. System-oriented evaluation o  Measuring effectiveness has been the most predominant in IR evaluation o  Test collection methodology - Benchmark (dataset) upon which effectiveness is measured and compared - Dataset tells for a given query what are the relevant documents o  Metrics to measure effectiveness - Precision and recall, and variants - E and F measures - Others (DCG, bpref) 9
  • 10. Test Collection methodology o  Compare retrieval performance using a test collection - Document collection, that is the document themselves. The document collection depends on the task, e.g. evaluating web retrieval requires a collection of HTML documents. - Queries, which simulate real user information needs. - Relevance judgements, stating for a query the relevant documents. o  To compare the performance of two techniques: - each technique used to answer queries - results (set or ranked list) compared using some effectiveness performance measure - most common measures are precision and recall o  Usually use multiple measures to get different views of performance o  Usually test with multiple collections as performance can be collection dependent 10
  • 11. Informa(on need, query and relevance o  The information need is translated into a query o  Relevance is assessed relative to the information need not the query - Information need: I am looking for information on what are the best places to go on holiday near the beach and play tennis - Query: tennis beach holiday - Evaluate whether the document addresses the information need, not whether it has the three words “tennis”, “beach” and “holiday” Sec. 8.1 11
  • 12. Relevance … as defined in system-oriented evaluation o  A document is relevant if it “has significant and demonstrable bearing on the matter at hand”. o  There are common assumptions about the nature of relevance in system-centred evaluation: - Objectivity: everybody agree on whether a document is relevant or not to a query - Topicality: relevance is about whether the document is about the topic expressed in the query - Binary nature: either a document is relevant or not - Independence: the fact that a document is relevant to a query has no effect of the relevance of another document for that same query 12
  • 13. Relevance is difficult to define satisfactorily o  A document is relevant within the context of a query - Who judges the relevance? è humans not very consistent (see next slide) - Is the document useful? è Utility - Judgment on whether a document is relevant or not depend on more than document and query o  With real collections, we never know the full set of relevant documents o  Retrieval model incorporates notion of relevance - Satisfiability of a logical expression in Boolean model - P(relevance | query, document) in BIRM - Similarity to query in VSM - P(query generated | document model) in LM 13
  • 14. Kappa measure for inter-judge relevance agreement o  Kappa measure - Agreement measure among judges (assessing document relevance) - Designed for categorical judgments (relevant or not) o  Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] o  P(A) – proportion of time judges agree o  P(E) – what agreement would be by chance o  Kappa = 0 for chance agreement, 1 for total agreement Sec. 8.5 14
  • 15. Kappa Measure: Example Number of documents assessed Judge 1 Judge 2 300 Relevant Relevant 70 Non-relevant Non-relevant 20 Relevant Non-relevant 10 Non-relevant Relevant Sec. 8.5 15 JudgesagreeJudgesdisagree
  • 17. Impact of inter-judge agreement on IR systems comparisons o  Impact on absolute effecGveness performance measure can be significant (0.32 vs 0.39) o  But liVle impact on ranking of different systems or rela(ve effecGveness performance o  If we just want to know if IR system A is beVer than IR system B è test collecGon methodology gives reliable comparison Sec. 8.5 17
  • 18. Find the relevant documents in the collection o  Did the IR system find all relevant document? o  To answer accurately, we need complete judgments - i.e., “yes,” “no,” or some score for every query-document pair o  For small test collections, we can review all documents for all queries o  Not practical for large or even medium-sized collection - TREC collections have millions of documents o  Pooling method o  Click-based evaluation in web search (later in the lecture) 18
  • 19. Test collection creation o  Manual method: - Every document in the collection is judged against every query by one of several judges (human assessors) - This is feasible for small document collection. o  Pooling method (used for large document collection): - The queries are run against several IR systems first - The top, for example 100, documents retrieved by each system are pooled together - The pool is then judged for relevance (by human assessors) - This is what TREC does o  Query logs (web search) è see later about “evaluation with clicks” 19
  • 20. Sample test collections (ad hoc retrieval) Characteristics Cranfield CACM ISI West TREC2 Collection size (docs) 1400 3204 1460 11953 742611 Collection size (MB) 1.5 2.3 2.2 254 2162 Year created 1968 1983 1983 1990 1991 Unique stems 8226 5493 5448 196707 1040415 Stem occurrences 123200 117578 98304 21798833 243800000 Max within document frequency 27 27 1309 Mean document length (words) 88 36.5 67.3 1823 328 Number of queries 225 50 35 44 100 20 ad hoc retrieval: query, document, ranking
  • 21. CIS o  1239 documents about cystic fibrosis from MEDLINE collection o  Fields: author, title, source, major and minor subjects, abstracts, references and citations o  100 queries, developed by relevance judges o  Unusual features: -  4 judges per document per query (3 experts, 1 medical bibliographer) -  3 levels of relevance (0-2) -  Combined relevance on scale of 0-8 222 2 221 2 211 2 111 2 222 1 221 1 211 1 111 1 000 0 21 Added so we do not forget history
  • 22. CACM o  3024 articles on computer science from CACM, 1958 - 1979 o  Fields: author, date, word stems for titles and abstracts, categories, direct referencing, bibliography coupling, number of co-citations for each pair of articles o  52 queries, each with 2 Boolean formulations o  Unusual features: - Citation links to other documents, so often used for hypertext-type experiments 22 Added so we do not forget history
  • 23. TREC o  Text REtrieval Conference/ Competition - http://trec.nist.gov - Run by NIST (National Institute of Standards & Technology) o  Collections: > Terabytes, o  Datasets - Newswire & full text news (AP, WSJ, Ziff, FT) - Government documents (federal register, Congressional Record) - Radio Transcripts (FBIS) - Web “subsets” - … 23
  • 25. Queries & relevance judgments at TREC o  Queries devised and judged by “information specialists” èTREC Topics o  Relevance judgments done only for those documents retrieved and not entire collection! - E.g. merge top 100 retrieved documents from systems experimented with (TREC participants) - Pooling method 25
  • 26. Example (excerpt) of a TREC document <doc> <docno> WSJ880406-0090 </docno> <hl> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </hl> <author> Janet Guyon (WSJ Sta) </author> <dateline> New York </dateline> <text> American Telephone & Telegraph Co. introduced the rest of a new generation of phone services with broad ... </text> </doc> 26
  • 27. Example (excerpt) of a TREC topic <top> <num> Number: 168 </docno> <title> Topic: Financing AMTRAK <desc> Description A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <nar> Narrative: A relevant document must provide information on the government's responsibility to make AMTRAK an economically viable entity. It could also discuss the privatisation of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant. </top> 27
  • 28. TREC legacy o  Pros: - made research systems scale to large collections (pre-WWW) - allows for controlled comparisons o  Cons: - emphasis on high recall, often unrealistic for what most users want è but recall-oriented search exist (patent retrieval, e-discovery) - very long queries, unrealistic è systems optimized for long queries and hence perform worse for shorter, more realistic queries - focus on batch ranking (one-off result) rather than interaction (but session track was introduced to evaluate a “user search session”) 28
  • 29. Others evaluation forums o  CLEF (Cross-Language Evaluation Forum) o  NCTIR (NII Testbeds and Community for Information access Research) o  FIRE (Forum for Information Retrieval Evaluation) o  INEX (The Initiative for the Evaluation of XML retrieval) 29
  • 30. Effectiveness o  We recall that the goal of an IR system is to retrieve as many relevant documents as possible and as few non- relevant documents as possible. o  Evaluating the above consists of a comparative evaluation of technical performance of IR system(s): - In traditional IR, technical performance means the effectiveness of the IR system: the ability of the IR system to retrieve relevant documents and suppress non-relevant documents - Effectiveness is measured by the combination of recall and precision 30
  • 31. Intuition behind precision and recall o  Collection of 10,000 documents, 50 relevant to a given topic o  Ideal system finds these 50 documents and reject all others o  An actual system likely identifies 25 documents; 20 are relevant and 5 were on other topics Precision: 20/25 = 0.8 (80% of retrieved document are relevant) Recall: 20/50 = 0.4 (40% of the relevant document are found) 31
  • 32. Measuring Precision and Recall Precision is easy to measure: o  Look at each document retrieved and decide whether it is relevant or not o  In previous example, only the 25 documents that are found need to be examined Recall is difficult to measure: o  To know all relevant items, we must go through entire collection, looking at every document to decide if it is relevant or not o  In previous example, all 10,000 documents must be examined! è remember the pooling method at TREC 32
  • 33. Recall / Precision Document collection Retrieved RelevantRetrieved and relevant Knowing which documents are relevant to which queries comes from the test collection For a given query, the document collection can be divided into three sets: the set of retrieved document, the set of relevant documents, and the rest of the documents. 33
  • 34. Recall / Precision In the ideal case, the set of retrieved documents is equal to the set of relevant documents. However, in most cases, the two sets will be different. This difference is formally measured with precision and recall. 34 Document collection Retrieved RelevantRetrieved and relevant precision = number of relevant documents retrieved number of documents retrieved recall = number of relevant documents retrieved number of documents relevant
  • 35. Retrieved vs. Relevant Documents Relevant Very high precision, very low recall retrieved 35 High precision rate is achieved by returning documents that we know for sure are relevant à Is this a good idea?
  • 36. Retrieved vs. Relevant Documents Relevant High recall, but low precision retrieved 36 100% recall can be achieved by returning all documents in the collection à This is for sure a bad idea!
  • 37. Retrieved vs. Relevant Documents Relevant Very low precision, very low recall (0 for both) retrieved 37 Total failure!
  • 38. Retrieved vs. Relevant Documents Relevant High precision, high recall retrieved 38 The perfect scenario!
  • 39. Recall and Precision The above two measures do not take into account where the relevant documents are retrieved, this is, at which rank (crucial since the output of most IR systems is a ranked list of documents). This is very important because an effective IR system should not only retrieve as many relevant documents as possible and as few non-relevant documents as possible, but also it should retrieve relevant documents before the non-relevant ones. 39 precision = number of relevant documents retrieved number of documents retrieved recall = number of relevant documents retrieved number of documents relevant
  • 40. Recall and Precision o  Let us assume that for a given query, the following documents are relevant (10 relevant documents) {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} o  Now suppose that the following documents are retrieved for that query: For each relevant document (in red bold), we calculate the precision value and the recall value. For example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the precision is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant documents being 10), so recall is 2/10. rank doc precision recall rank doc precision recall 1 2 3 4 5 6 7 d123 d84 d56 D6 d8 d9 d511 1/1 2/3 3/6 1/10 2/10 3/10 8 9 10 11 12 13 14 d129 d187 d25 d48 d250 d113 d3 4/10 5/14 4/10 5/10 40
  • 41. Recall and Precision o  For each query, we obtain pairs of recall and precision values - In our example, we would obtain (1/10, 1/1) (2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) … which are usually expressed in % (10%, 100%) (20%, 66.66%) (30%, 50%) (40%, 40%), (50%, 35.71%) … - This can be read for instance: at 20% recall, we have 66.66% precision; at 50% recall, we have 35.71% precision The pairs of values are plotted into a graph, which has the following curve Recall (%) Precision (%) 10 20 30 40 50 60 70 80 90 100 100 90 80 70 60 50 40 30 20 10 41
  • 42. Recall and Precision o  We have shown how to derive the recall and precision curve for a given query o  Now we describe how using the above for all queries, the effectiveness of an IR system is evaluated and thus compared to other IR systems. o  Note that we can also compare the same system, but with different versions (e.g. different parameters are used). The idea here is to find out the best version of the IR system. 42
  • 43. The complete methodology For each IR system / IR system version 1.  For each query in the test collection a.  We first run the query against the system to obtain a ranked list of retrieved documents b.  We use the ranking and relevance judgements to calculate recall/precision pairs 2.  Then we average recall / precision values across all queries, to obtain an overall measure of the effectiveness 43
  • 44. Averaging across queries o  Hard to compare precision and recall graphs or tables for individual queries (too much data) - Need to average over many queries o  Two main types of averaging - Macro-average: each query is a point in the average - Micro-average: each relevant document is a point in the average - Macro is mostly used (all queries count equally) 44
  • 45. (Macro) Interpolated average precision o  Average precision at standard recall points o  For a given query, compute precision and recall point for every relevant document o  Interpolate precision at standard recall levels - 11-pt is usually 100%, 90%, 80%, ..., 10%, 0% o  Average over all queries to get average precision at each recall level 45
  • 46. Interpolation 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 recall Interpolated valueObserved value precision It is often the case that recall values are not given for standard recall values (10%, 20%, ….). We therefore need to interpolate to obtain standard recall values. For example, the value is 25%, and is interpolated to the nearest standard recall value on the right, that is 30%. 46
  • 47. Interpolated average precision 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 recall query1 query 2 average We have precision values at standard recall values for two queries. The precision values for query 1 are higher than those for query 2. This means that the effectiveness of the IR system is better for query 1 than for query 2. We can plot the average of the two queries. 47 precision
  • 48. Averaging The same information can be displayed in a table. 48 Precision in % Recall in % Query 1 Query 2 Average 10 80 60 70 20 80 50 65 30 60 40 50 40 60 30 45 50 40 25 32.5 60 40 20 30 70 30 15 30 80 30 10 22.5 90 20 5 11.5 100 20 5 11.5
  • 49. Comparison of systems 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 recall precision system 1 system 2 We can now compare IR systems / system versions. For example, here we see that at low recall, system 2 is better than system 1, but this changes from recall value 30%, etc. It is common to calculate an average precision value across all recall levels, so that to have a single value to compare. 49
  • 50. Averaging across averages o  Average interpolated recall levels to get single result - Called “interpolated average precision” -  Not used much anymore; “mean average precision” more common -  Values at specific interpolated points still commonly used o  Mean average precision (MAP) - (“Average average precision” sounds weird) - Average precision over all relevant documents, non-interpolated - Reward systems that retrieve relevant documents quickly (highly ranked) 50
  • 51. Mean Average Precision Consider rank position of each relevant document (n) for given query r1, r2, … rn Compute precision@r (denoted P@r) for each r1, r2, … rn Average precision = average of P@r for given query MAP is Average Precision across multiple queries 1 3 .( 1 1 + 2 3 + 3 5 ) ⇡ 0.76 51
  • 52. Mean Average Precision (MAP) 52 average precision query 1 (AP) = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62 average precision query 2 (AP) = (0.5 + 0.4 + 0.43)/3 = 0.44 mean average precision (MAP) = (0.62 + 0.44)/2 = 0.53
  • 53. More about mean average precision (MAP) o  If a relevant document is not retrieved, precision corresponding to that relevant document is zero o  Most commonly used measure in research papers … with issues o  Not so good for web search evaluation (precision oriented) - MAP assumes user is interested in finding many relevant documents 53
  • 54. TREC (trec_eval) evaluation results Recall Level Precision Average Recall Precision 0.0 0.1 … 1.0 0.61 0.45 … 0.003 average precision over all relevant documents Non-interpolated (MAP) 0.23 54
  • 55. Average precision per query 1.0 -1.0 0.0 200 201 202 203 204 …… Topic ids Difference average precision 55 A system may perform badly for some information needs (MAP = 0.1) and excellently on others (MAP = 0.7) èoften the case that variance in performance of same system across queries is much greater than variance of different systems on the same query There are easy informaGon needs and hard ones!
  • 56. Rank-based measures o Binary relevance - Mean Average Precision (MAP) - P@r - R-Precision - Mean Reciprocal Rank (MRR) - bpref o Multiple levels of relevance - Normalized Discounted Cumulative Gain (NDCG) 56
  • 57. P@r or Precision @ rank r Set a rank threshold r Compute % relevant documents in top r Ignores documents ranked lower than r P@3 = 2/3 P@4 = 2/4 P@5 = 3/5 actual performance as a user might see it often used in web retrieval used at fixed rank values: P@5, P@10 57 Note the slight difference with P@r in slide 51
  • 58. R-Precision o  Precision after R documents are retrieved o  R = number of relevant documents for the topic o  De-emphasize exact ranking of retrieved relevant documents, which can be useful for topics with large number of relevant documents o  Perfect system could score 1.0 o  Average R-precision - Example: 2 topics, with 50 and 10 relevant documents respectively. - Assume IR system return 17 relevant documents in the top 50 documents for 1st topic and 7 relevant documents in top 10 for 2nd topic - Average R-precision for this IR system is (17/50 + 7/10) / 2 = 0.52 58
  • 59. Mean Reciprocal Rank (MRR) o Suppose there is only one relevant document o Scenarios: known-item search, navigational queries, looking for a fact o Search duration à rank of the answer measures a user effort in finding that one and only document Consider rank position, r, of first relevant document Reciprocal Rank score = MRR is the mean RR across multiple queries 1 r 59
  • 60. E-measure o  Used to emphasize precision (or recall) - Essentially a weighted average of precision and recall - Large α increases importance of precision o  Can be transformed by α = 1/(β2+1) leading to - When β =1 (α=1/2) equal importance of precision and recall - Normalised symmetric difference of retrieved and relevant sets 60 E = 1 1 ↵ 1 P + (1 ↵) 1 R E = 1 ( 2 + 1)PR 2P + R
  • 61. Symmetric Difference and E A is the retrieved set of documents B is the set of relevant documents P = |A∩B|/|A| R = |A∩B|/|B| A⊗B (the symmetric difference) is the shaded area A⊗B = |A∪B|- |A∩B| = |A|+|B|-2|A∩B| Eβ=1=1-(2PR / (P+R)) = (P+R-2PR)/(P+R) = … = A⊗B / (|A|+|B|) symmetric difference normalised 61 A B A∩B
  • 62. F measure o  F = 1-E often used - Good results mean larger values of F - “F1” measure is popular: F with β=1 - particular popular with evaluating classification approaches harmonic mean of P and R 62 F = 1 E = ( 2 + 1)PR 2P + R F1 = 2PR P + R = 1 1 2 ( 1 R + 1 P ) Harmonic mean is a conservaGve average
  • 63. F measure, geometric interpretation A is the retrieved set of documents B is the set of relevant documents P = |A∩B|/|A| R = |A∩B|/|B| A B A∩B 63 F =1 = 2PR/(P + R) = 2 |A B|2 |A| + |B| /(|A B|( 1 |A| + 1 |B| )) = 2|A B| |A| + |B|
  • 64. Relation to Contingency Table Why is accuracy not much used in IR in large documents collections? - Most document are NOT relevant - Most documents are NOT retrieved - Inflates the accuracy value Document is Relevant Document is NOT relevant Document is retrieved a b Document is NOT retrieved c d 64 Accuracy : (a + d)/(a + b + c + d) Precision : a/(a + b) Recall : a/(a + c)
  • 65. fair fair Good Are all relevant documents “equally” relevant? 65 Excellent
  • 66. Discounted Cumulative Gain (DCG) o  Popular measure for evaluating web search o  Two assumptions: - Highly relevant documents are more useful than marginally relevant documents - The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined 66
  • 67. Discounted Cumulative Gain (DCG) o  Uses graded relevance as a measure of usefulness, or gain, from examining a document o  Gain is accumulated starting at the top of the ranking and can be reduced, or discounted, at lower ranks o  Typical discount is 1/log(rank) - With base 2, the discount at rank 4 is 1/2 , and at rank 8 it is 1/3 67
  • 68. Summarize a Ranking with DCG o  Relevance judgments in a scale of [0,r] with r>2 o  Cumulative Gain (CG) at rank n - Let the ratings of the n documents be r1, r2, …rn (in ranked order) - CG = r1+r2+…+rn o  Discounted Cumulative Gain (DCG) at rank n - DCG = r1 + r2/log22 + r3/log23 + … + rn/log2n (We may use any base for the logarithm) 68 DCGn = rel1 + nX i=2 reli log2i
  • 69. DCG Example o  10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 o  discounted gain (CG): 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 o  discounted cumulative gain (DCG): 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 69 0 5 10 15 0 2 4 6 8 10 12 rank
  • 70. Summarize a Ranking with NDCG o  Normalized Discounted Cumulative Gain (NDCG) at rank n - Normalize DCG at rank n by the DCG value at rank n of the ideal ranking - Ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, and so on (we get Max DCG) o  Normalization useful for contrasting queries with varying numbers of relevant documents o  NDCG popular in evaluating web search70 NDCG = DCG MaxDCG
  • 71. NDCG Example rank i Ideal system (IS) System 1 (S1) System 2 (S2) Document Order ri Document Order ri Document Order ri 1 d4 2 d3 2 d3 2 2 d3 2 d4 2 d2 1 3 d2 1 d2 1 d4 2 4 d1 0 d1 0 d1 0 NDCGIS=1.00 NDCGS1=1.00 NDCGS2=0.9203 4 documents: d1, d2, d3, d4 71 DCGIS = 2 + ( 2 log22 + 1 log23 + 0 log24 ) = 4.6309 DCGS1 = 2 + ( 2 log22 + 1 log23 + 0 log24 ) = 4.6309 DCGS2 = 2 + ( 1 log22 + 2 log23 + 0 log24 ) = 4.2619 MaxDCG = DCGIS = 4.6309
  • 72. Problem with the test collection methodology o  Building larger test collections along with complete relevance judgment is difficult or impossible - require assessor time, which is very expensive - require many diverse retrieval “runs” o  Recall is difficult if not impossible to get correctly as there is no way we can find all the relevant documents for each query o  Precision at top n often not stable enough o  Issue: - Non-judged documents are assumed non-relevant - Can we reuse the test collection later on? 72
  • 73. bpref measure o  Binary preference-based measure -  Introduced in 2004 -  Unlike MAP, P@10, and recall and precision, only uses information from judged documents o  A function of how frequently relevant documents are retrieved before non- relevant documents. R is the number of judged relevant documents, r is a relevant retrieved document, and n is a member of the first R irrelevant retrieved documents. Non judged documents are ignored. 73 bpref = 1 R X r 1 n ranked higher than r R
  • 74. bpref measure o  When comparing systems over test collections with complete judgments, MAP and bpref are reported to be equivalent o  With incomplete judgments, bpref is shown to be more stable -  We look at what happen when we use less queries, more queries -  We look at what happen when we swap documents in the ranking 74
  • 75. bpref - Example Retrieved result set with D2 and D5 being relevant: D1 D2 D3 not judged D4 -------- D5 D6 D7 D8 D9 D10 R=2 bpref= 1/2 [1-(1/2)]75
  • 76. bpref - Example Retrieved result set with D2, D5 and D7 are relevant: D1 D2 D3 not judged D4 not judged D5 D6 D7 D8 ---------- D9 D10 R=3 bpref= 1/3 [(1 -1/3) + (1 -1/3) + (1 -2/3)]76
  • 77. bpref Example Retrieved result set with D2, D4, D6 and D9 are relevant: D1 D2 D3 D4 D6 D7 D8 ---------- D9 D10 R=4 bpref= 1/4 [(1-1/4) + (1 -2/4) + (1 -2/4)] 77
  • 78. Evaluating interaction with the IR systems o  Empirical data involving human users is time consuming to gather and difficult to draw universal conclusions from o  Evaluation metrics for user interaction (interface) - Time required to learn the system - Time to achieve goals on benchmark tasks - Error rates - Retention of the use of the interface over time - User satisfaction 78
  • 79. Why significance testing o  System A beats System B on one query -  Is it just a lucky query for System A? -  Maybe System B does better on some other query? -  Needs as many queries as possible Empirical research suggests 25 is minimum needed TREC tracks generally aim for at least 50 queries o  Systems A and B identical on all but one query -  If System A beats System B by enough on that one query, average will make A look better than B As above could just be a lucky break for System A -  Need A to beat B frequently to believe A is really better o  System A is only 0.00001% better than system B -  Even if true in all queries, does it mean much o  Significance testing consider those issues 79
  • 80. Significance tests o  Are observed differences statistically different? -  Make use of statistics o  Generally we cannot make assumptions about underlying distribution -  Most significance tests do make such assumptions o  Significance tests are easier to do on single-valued effectiveness measures (MAP, bpref) o  Example: Sign test -  Do not require that data be normally distributed -  For techniques A and B, compare average precision for each pair of results generated by queries in the test collection -  If difference is large enough, count as + or -, otherwise ignore -  Use number of +’s and the number of significant differences to determine significance level 80
  • 81. Measures for large-scale systems … web search o  Typical user behavior in web search shown preference for high precision o  Graded scales of relevance seem more useful than binary è NDCG o  Recall difficult to measure on the web -  Often use precision at top k, such as k=5, k =10, … o  . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right è NDCG o  Use non-relevance-based datasets such as click-through data (query logs) o  A/B testing 81
  • 82. A/B tes(ng o  Test a a single new “innovaGon” o  Have most users use old system o  Divert a small proporGon of traffic (e.g., 1%) to the new system that includes the innovaGon o  Evaluate with an “automaGc” measure like click-through rates o  Now we can directly see if the innovaGon does improve retrieval performance (e.g. click-through rate) o  Probably the evaluaGon methodology that large search engines trust most Sec. 8.6.3 82
  • 83. Bias in where users click # of clicks received Strong position bias, so absolute click rates unreliable 83
  • 84. Relative vs absolute ratings Hard to conclude Result1 > Result3 Probably can conclude Result3 > Result2 User click sequence pairwise relative rating instead of individual rating Assess in terms of conformance with historical pairwise preferences recorded from user clicks 84
  • 85. Comparing two rankings via clicks and interleaving method Kernel machines SVM-light Lucent SVM demo Royal Holl. SVM SVM software SVM tutorial Kernel machines SVMs Intro to SVMs Archives of SVM SVM-light SVM software Query: [support vector machines] System A System B 85 (Joachims, 2002)
  • 86. Interleave the two rankings and remove duplicates Kernel machines SVM-light Lucent SVM demo Royal Holl. SVM Kernel machines SVMs Intro to SVMs Archives of SVM SVM-light Kernel machines SVM-light Lucent SVM demo Royal Holl. SVM Kernel machines SVMs Intro to SVMs Archives of SVM SVM-light 86
  • 87. Count user clicks 87 Kernel machines SVM-light Lucent SVM demo Royal Holl. SVM Kernel machines SVMs Intro to SVMs Archives of SVM SVM-light Clicks Ranking A: 3 Ranking B: 1 ê A, B A A System A is better than System B
  • 88. 88 o  Focus on measuring its effectiveness rather than efficiency o  We recall that: - Effectiveness is the ability to make the right classification decision - Efficiency is concerned with time and space requirement Evaluation of classifiers
  • 89. 89 Evaluation of classifiers o  After a classifier is constructed using a training set, the effectiveness is evaluated using a test set o  For each category ci, we calculate the following sets: - TPi: true positives -  FPi: false positives -  TNi: true negatives -  FNi: false negatives
  • 90. 90 True and false positives with respect to a cageory o  TPi: true positives with respect to category ci - the set of documents that both the classifier and the previous judgments (as recorded in the test set) classify under ci o  FPi: false positives with respect to category ci - the set of documents that the classifier classifies under ci, but the test set indicates that they do not belong to ci
  • 91. 91 o  TNi: true negatives with respect to category ci - both the classifier and the test set agree that the documents in TNi do not belong to ci o  FNi: false negatives with respect to category ci - the classifier do not classify the documents in FNi under ci, but the test set indicates that they should be classified under ci True and false negatives with respect to a cageory
  • 92. 92 Evaluation measures for classifiers o  Precision with respect to category ci o  Recall with respect to category ci TPiFPi FNi TNi Classified ci (what it returns) Test Class ci (what it should return) Pi = TPi TPi + FPi Ri = TPi TPi + FNi
  • 93. 93 Evaluation measures for classifiers o  for obtaining estimates for precision and recall in the collection as a whole, two different methods may be adopted: - Micro-averaging counts for true positives, false positives and false negatives for all categories are first summed up precision and recall are calculated using the global values - Macro-averaging average of precision (recall) for individual categories
  • 94. 94 Micro- vs macro-averaging o  microaveraging and macroaveraging may give quite different results, if the different categories have very different generality o  e.g. the ability of a classifier to behave well also on categories with low generality (i.e. categories with few positive training instances) will be emphasized by macroaveraging o  choice depends on the application
  • 95. Conclusions … some few words o  Here we solely focused on system-oriented evaluation. We should not forget about user-oriented evaluation o  Here we focus on batch-style evaluation. We should not forget that search is part of a bigger task. o  At the end, it is all about making the users “happy”. We should not forget about long-term engagement. o  Lots of work and research looked beyond precision and recall, in terms of validations, extensions or alternatives o  Lots of work such as “significance testing” so that we can be sure that IR system A is indeed better than IR system B. o  Here we focused on “document” and text. We should not forget multimedia, mobile, social media, etc, where evaluating effectiveness may mean something a bit different. 95