Search engines are not “objective" pieces of technology, and bias in Delpher's search engine may or may not harm user access to certain type of documents in the collection. In the worst case, systematic favoritism for a certain type can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the “retrievability" for all documents in a collection. We explain the ideas underlying the retrievability metric, and how we measured it on the KB Newspaper collection. We describe and quantify the retrievability bias imposed on the newspaper collection by three different commonly used Information Retrieval models. For this, we investigated how document features such as length, type, or date of publishing influence the retrievability.
We also investigate the effectiveness of the retrievability measure, featuring two characteristics that set our experiments apart from previous studies: (1) the newspaper collection contains noise originating from OCR processing, and historical spelling and use of language; and (2) rather than the simulated queries used in other studies, we use real user query logs including click data. We show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the results of a retrieval task.
2. Motivation
• Users want to be able
• to access all (relevant) documents
in Delpher
• to get a fair overview of Delpher’s
content
• However,
• data collections are implicitly
biased,
• users are biased,
• and technology induces even more
bias(es)
2
3. Motivation
• Users want to be able
• to access all (relevant) documents
in Delpher
• to get a fair overview of Delpher’s
content
• However,
• data collections are implicitly
biased,
• users are biased,
• and technology induces even more
bias(es)
… which I can
deal with if the bias is
made explicit.
#toolcrit
2
4. Motivation
• Users want to be able
• to access all (relevant) documents
in Delpher
• to get a fair overview of Delpher’s
content
• However,
• data collections are implicitly
biased,
• users are biased,
• and technology induces even more
bias(es)
… which I can
deal with if the bias is
made explicit.
#toolcrit
2
Note:
Bias is not necessarily a
bad thing!
5. Research Questions
RQ1: Is the access to the digitized
newspaper collection in Delpher
influenced by a retrievability bias?
RQ2: Can we correlate the features of a
document (such as document length, time
of publishing, type of document, etc.) with
its retrievability scores?
RQ3: To what extent are retrievability
experiments using simulated queries
representative for the search behavior of
real users?
3
6. Retrievability
• Introduced by Azzopardi et al. [1] in 2008 in a study based
on born-digital documents and simulated queries
• Measures the accessibility of all documents in a collection for
a given set of queries
• Retrievability score r(d) measures how
often a document d is retrieved by a
given set of queries
• Gini coefficient and Lorenz curves can visualize and
quantify inequality in the distribution of r(d) scores
4
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on
Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
7. 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
0, 0, 0, 0, 1
1, 1, 1, 1, 1
0, 0, 1, 1, 2
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
visualize inequality in wealth
distribution
• Ranges between 0 and 1:
• perfect tyranny (G=0.8)
• perfect communist (G=0)
• in-between (G=0.5)
• There is no good or bad G.
5
8. 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
0, 0, 0, 0, 1
1, 1, 1, 1, 1
0, 0, 1, 1, 2
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
visualize inequality in wealth
distribution
• Ranges between 0 and 1:
• perfect tyranny (G=0.8)
• perfect communist (G=0)
• in-between (G=0.5)
• There is no good or bad G.
5
% of documents
%ofaccumulatedr(d)
10. Simulated Queries
• Followed similar strategy as previous studies
• Top 2 million single terms from the
preprocessed corpus + top 2 million bigram
terms
• No filtering for OCR errors
7
11. Real Queries
• User logs collected between March and July 2015 on
Delpher
• Extracted queries and view data related to newspaper
archive
• Total of 957,239 unique queries
8
17. • The Lorenz curves and Gini values
• are strongly influenced by 0 values,
• can indicate the degree of bias, but they
tell us nothing about the type of bias.
13
Limitations
18. • The Lorenz curves and Gini values
• are strongly influenced by 0 values,
• can indicate the degree of bias, but they
tell us nothing about the type of bias.
13
Limitations
Does it
arise from the users’
interest / search behavior?
Or a technological bias towards a
particular document
feature?
19. Frequencies of r(d)
values
14
• Real queries (top):
• maximum r(d)=4319
• tend to retrieve a few
documents more often
• Simulated queries (bottom):
• maximum r(d)=807
• tend to retrieve a larger
number of documents
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
10000000
30000000
0 500 1000 1500 2000 2500 3000 3500 4000
r(d)
counts
1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5000000
10000000
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800
r(d)
counts
20. R(d) Values Meaningful?
• Created 4 subsets of documents according to their r(d) score and
selected a set of target documents from each subset
• Generated queries from selected documents, tailored to retrieve these
specific documents
• Performed the search tasks and measured ranks of the target documents
• Showed that documents with lower r(d) score are actually harder to find
15
Hardly retr.
Few times retr.
Often retr.
Very often retr.
25. Newspaper Titles
• Number of articles range
from one to 16,348,557
(mean 82,638, median 127)
• Subset of the 10 most
prevalent newspaper titles
• Mean r(d)
• Top 3 titles are regional
ones
20
Newspaper Title Mean r(d)
Leeuwarder courant: hoofdblad van
Friesland 0.15
Nieuwsblad van het Noorden 0.14
Limburgsch dagblad 0.12
Het vrije volk: democratisch-
socialistisch dagblad 0.10
De Tijd: godsdienstig-staatkundig
dagblad 0.08
Het Vaderland: staat- en letterkundig
nieuwsblad 0.07
Leeuwarder courant 0.07
Algemeen Handelsblad 0.06
De Telegraaf 0.06
Rotterdamsch nieuwsblad 0.05
26. Document Types
• Hardly any differences
for simulated queries
• In real queries, the
official notifications
stand out with a much
higher score
Real Simulated
Article 0.90 3.89
Advertisement 0.51 3.32
Notification* 4.80 3.22
Caption 0.84 3.06
Mean r(d) for BM25, c=100
21
* Familiebericht
30. Differences between
query sets
• Real queries:
• Mean length: 2.32 terms
• Unique terms: 253,637
• 56 references to persons
or locations in top 100
terms
• Simulated queries:
• Mean length: 1.5 terms
• Unique terms: 2,028,617
• 5 references to persons
or locations in top 100
terms
25
31. 1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700
Number of Views
Counts
Document views
• 2.7M out of 102M documents were viewed by users
• Shape of the frequency distribution plot is very similar to the r(d)
frequency plots
• Most documents only viewed once, very few are viewed more often
26
32. Overlap with views
• How many documents were viewed
by Delpher users, but not retrieved in
our study?
• Many non-retrieved documents
• were found using facets, operators
• scored a rank just below the cutoff
• Use a smoother cost function based
on the ranking
• Better representation of the real
search engine, taking faceted search /
operators into account
0
0.75
1.5
2.25
3
c=10 c=100 c=1000
Retrieved
Non-Retrieved
27
34. Parameter Sets for
Preprocessing
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on
Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
29
Parameters Stemming Stopwords Operators
PS1
(as used by[1])
yes removed removed
PS2 no kept removed
PS3
(only LM1000)
yes removed kept
36. Conclusions
• Real and simulated queries
differ in regard to
• composition of query sets
• number of (unique) terms
used
• use of named entities
• Apart from document length
and page confidence, we did
not find strong evidence for
technical bias
• Using real queries is important
for realistic results
• Simulation strategies for
queries need to be improved
31