The document discusses sources of bias in search results from a large newspaper archive and methods for quantifying this bias. It finds significant retrievability bias based on Lorenz curves and Gini coefficients, with many documents never retrieved. Certain document features like date, size and type correlate with lower retrievability scores. Real user queries exhibit more bias than simulated queries. Quantifying and understanding biases can help provide more representative search results.
2. Motivation
• Users want to be able
• to get a fair overview of the
archive’s content
• to access all (relevant) documents
in the archive
2
3. Motivation
• Users want to be able
• to get a fair overview of the
archive’s content
• to access all (relevant) documents
in the archive
• However,
• data collections are implicitly
and explicitly biased,
• users are biased,
• and technology induces even
more bias(es)
2
4. Motivation
• Users want to be able
• to get a fair overview of the
archive’s content
• to access all (relevant) documents
in the archive
• However,
• data collections are implicitly
and explicitly biased,
• users are biased,
• and technology induces even
more bias(es)
… which I can deal
with if the bias is made
explicit.
2
6. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
Retrievability Bias
3
7. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
• Collection bias (indexed documents)
Retrievability Bias
3
8. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
• Collection bias (indexed documents)
• OCR errors
Retrievability Bias
3
9. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
• Collection bias (indexed documents)
• OCR errors
• Side-effects of ranking algorithm
Retrievability Bias
3
10. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
• Collection bias (indexed documents)
• OCR errors
• Side-effects of ranking algorithm
• Side-effects of result presentation
Retrievability Bias
3
11. • Bias in search results
• Potential sources are:
• User interest
• Search skills of users
• Users’ willingness to explore results
• Collection bias (indexed documents)
• OCR errors
• Side-effects of ranking algorithm
• Side-effects of result presentation
Retrievability Bias
3
12. Research Questions
RQ1: Detecting and quantifying
retrievability bias
RQ2: Influence of document features on
retrievability bias
RQ3: Representativeness of simulated
queries and experimental setup
4
13. Retrievability
• Introduced by Azzopardi et al. [1] in 2008 in a study
based on born-digital documents and simulated
queries
• Retrievability score counts how
often a document is retrieved as one of
the top K documents by a given set of queries
• Gini coefficient and Lorenz curves can visualize and
quantify inequality in the distribution of the scores
5
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
14. Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
6
15. Lorenz curve for n=5
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
• perfect communist (G=0)
6
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
1, 1, 1, 1, 1
16. Lorenz curve for n=5
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
• perfect communist (G=0)
• in-between (G=0.5)
6
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
1, 1, 1, 1, 1
0, 0, 1, 1, 2
17. Lorenz curve for n=5
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
• perfect communist (G=0)
• in-between (G=0.5)
• perfect tyranny (G=0.8)
6
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
1, 1, 1, 1, 1
0, 0, 1, 1, 2
0, 0, 0, 0, 1
18. Lorenz curve for n=5
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
• perfect communist (G=0)
• in-between (G=0.5)
• perfect tyranny (G=0.8)
6
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
1, 1, 1, 1, 1
0, 0, 1, 1, 2
0, 0, 0, 0, 1
% of documents
%ofaccumulatedr(d)
19. Lorenz curve for n=5
Lorenz Curve
& Gini Coefficient
• Introduced by economists to
express and visualize inequality
in wealth distribution
• Gini coefficient (G):
• perfect communist (G=0)
• in-between (G=0.5)
• perfect tyranny (G=0.8)
• There is no good or bad G.
6
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
% of population
%ofincome
1, 1, 1, 1, 1
0, 0, 1, 1, 2
0, 0, 0, 0, 1
% of documents
%ofaccumulatedr(d)
20. Experimental setup /
Parameters
• Digitized collection of Dutch historic newspapers
• View data extracted from user logs
• Real queries, simulated queries
• Standard Information Retrieval models: TFIDF, LM1000, BM25
(using Lemur framework)
• Pre-processing (corpus & queries): Stemming, stopword removal,
operator removal
• Cutoff values: c=10, c=100, c=1000
7
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
22. Simulated Queries
• Followed similar strategy as previous studies
• Top 2 million single terms from the
preprocessed corpus + top 2 million bigram
terms
• No filtering for OCR errors
9
23. Real Queries
• User logs collected between March and July 2015 on
Delpher, the online web service of the National Library of
the Netherlands
• Extracted queries and viewed items related to newspaper
archive
• Total of 957,239 unique queries
10
28. • The Lorenz curves and Gini values
• are strongly influenced by non-retrieved
documents,
• can indicate the degree of bias, but they
tell us nothing about the type of bias.
14
Limitations
29. • The Lorenz curves and Gini values
• are strongly influenced by non-retrieved
documents,
• can indicate the degree of bias, but they
tell us nothing about the type of bias.
14
Limitations
Does
the inequality arise
from the users’ interest /
search behavior?
Or from a technological bias
towards a particular
document feature?
30. Retrievability scores
Meaningful?
• Created 4 subsets of documents according to their score and selected a set of
target documents from each subset
• Generated queries from selected documents, tailored to retrieve these specific
documents
• Performed search tasks and measured ranks of target documents
• Showed that documents with lower score are actually harder to find
15
Rarely Sometimes Often Very often
37. Differences between
query sets
• Real queries:
• Mean length: 2.32 terms
• Unique terms: 253,637
• 56 references to persons
or locations in top 100
terms
• Simulated queries:
• Mean length: 1.5 terms
• Unique terms: 2,028,617
• 5 references to persons
or locations in top 100
terms
22
38. 1
5
10
50
100
500
1000
5000
10000
50000
100000
500000
1000000
5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700
Number of Views
Counts
Actual views
• Only 2.7M out of 102M documents were viewed by users (G = 0.98)
• most documents have not been viewed at all
• many documents only viewed once
• very few are viewed multiple times
23
39. Overlap with views
• How many documents were viewed
by the users, but not retrieved in
our study?
• Many non-retrieved documents
• were found using facets or
operators
• scored a rank just below the
cutoff
• Better representation of the
real search engine, taking faceted
search and operators into account
0
0.75
1.5
2.25
3
c=10 c=100 c=1000
Retrieved
Non-Retrieved
24
41. Conclusions
• Real and simulated queries differ in
regard to
• composition of query sets
• number of (unique) terms used
• use of named entities
• Apart from document length and
page confidence, we did not find
strong evidence for technical bias
• Using real queries is important for
realistic results
• Simulation strategies for queries
need to be improved
• Retrievability studies should take
faceted search and operators into
account
26
42. We would like to thank the
for making the newspaper corpus and the
(sensitive) user data available to us for research.
travel grant
Supported
by
Querylog-based Assessment of
Retrievability Bias in a Large
Newspaper Corpus