Information Retrieval Models Part I

IR Models
Part I | Foundations
Thomas Roelleke, Queen Mary University of London
Ingo Frommholz, University of Bedfordshire
Autumn School for Information Retrieval and Foraging
Schloss Dagstuhl, September 2014
ASIRF Sponsors:

IR Models
Acknowledgements
The knowledge presented in this tutorial and the Morgan & Claypool book is
the result of many many discussions with colleagues.
People involved in the production and reviewing: Gianna Amati and Djoerd
Hiemstra (the experts), Diane Cerra and Gary Marchionini (Morgan &
Claypool), Ricardo Baeza-Yates, Norbert Fuhr, and Mounia Lalmas.
Thomas' PhD students (who had no choice): Jun Wang, Hengzhi Wu, Fred
Forst, Hany Azzam, Sirvan Yahyaei, Marco Bonzanini, Miguel Martinez-Alvarez.
Many more IR experts including Fabio Crestani, Keith van Rijsbergen, Stephen
Robertson, Fabrizio Sebastiani, Arjen deVries, Tassos Tombros, Hugo Zaragoza,
ChengXiang Zhai.
And non-IR experts Fabrizio Smeraldi, Andreas Kaltenbrunner and Norman
Fenton.
2 / 133

IR Models
Table of Contents
1 Introduction
2 Foundations of IR Models
3 / 133

Introduction
Warming Up
Background: Time-Line of IR Models
Notation

IR Models
Introduction
Warming Up
Information Retrieval Conceptual Model
DD
rel.
judg.
aQ b
a
Q
Q
D
b
r
IR
Q D
D D
Q
D
R
[Fuhr, 1992]
6 / 133

IR Models
Introduction
Warming Up
Vector Space Model, Term Space
Still one of the prominent IR frameworks is the Vector Space
Model (VSM)
A term space is a vector space where each dimension
represents one term in our vocabulary
If we have n terms in our collection, we get an n-dimensional
term or vector space
Each document and each query is represented by a vector in
the term space
7 / 133

IR Models
Introduction
Warming Up
Formal Description
Set of terms in our vocabulary: T = ft1; : : : ; tng
T spans an n-dimensional vector space
Document dj is represented by a vector of document term
weights
Query q is represented by a vector of query term weights
8 / 133

IR Models
Introduction
Warming Up
Document Vector
weights dji 2 R:
Document term weights can be computed, e.g., using tf and
idf (see below)
9 / 133

IR Models
Introduction
Warming Up
Document Vector
weights dji 2 R:
Weight of term
in document
Document term weights can be computed, e.g., using tf and
idf (see below)
10 / 133

IR Models
Introduction
Warming Up
Query Vector
Like documents, a query q is represented by a vector of query
term weights qi 2 R:
~q =
0
BB@
q1
q2
: : :
qn
1
CCA
qi denotes the query term weight of term ti
qi is 0 if the term does not appear in the query.
qi may be set to 1 if the term does appear in the query.
Further query term weights are possible, for example
2 if the term is important
1 if the term is just nice to have"
11 / 133

IR Models
Introduction
Warming Up
Retrieval Function
The retrieval function computes a retrieval status value (RSV)
using a vector similarity measure, e.g. the scalar product:
RSV (dj ; q) = ~dj ~q =
Xn
i=1
dji qi
t
t
1
2
q
d
d
1
2
Ranking of documents according to decreasing RSV
12 / 133

IR Models
Introduction
Warming Up
Example Query
Query: side eects of drugs on memory and cognitive abilities
ti Query ~q ~ d1 ~ d2 ~ d3 ~ d4
side eect 2 1 0.5 1 1
drug 2 1 1 1 1
memory 1 1 0 1 0
cognitive ability 1 0 1 1 0.5
RSV 5 4 6 4.5
Produces the ranking d3 d1 d4 d2
13 / 133

IR Models
Introduction
Warming Up
Term weights: Example Text
In his address to the CBI, Mr Cameron is expected to say:
Scotland does twice as much trade with the rest of the UK than
with the rest of the world put together { trade that helps to support
one million Scottish jobs.Meanwhile, Mr Salmond has set out six
job-creating powers for Scotland that he said were guaranteed with
a Yes vote in the referendum. During their televised BBC debate
on Monday, Mr Salmond had challenged Better Together head
Alistair Darling to name three job-creating powers that were being
oered to the Scottish Parliament by the pro-UK parties in the
event of a No vote.
Source: http://www.bbc.co.uk/news/uk-scotland-scotland-politics-28952197
What are good descriptors for the text? Which are more, which are
less important? Which are informative? Which are good
discriminators?
How can a machine answer these questions?
14 / 133

IR Models
Introduction
Warming Up
Frequencies
The answer is counting
Dierent assumptions:
The more frequent a term appears in a document, the more
suitable it is to describe its content
Location-based count. Think of term positions or locations.
In how many locations of a text do we observe the term?
The term `scotland' appears in 2 out of 138 locations in the
example text
The less documents a term occurs in, the more discriminative
or informative it is
Document-based count. In how many documents do we
observe the term?
Think of stop-words like `the', `a' etc.
Location- and document-based frequencies are the building
blocks of all (probabilistic) models to come
15 / 133

Introduction

IR Models
Introduction
Timeline of IR Models: 50s, 60s and 70s
Zipf and Luhn: distribution of document frequencies;
[Croft and Harper, 1979]: BIR without relevance;
[Robertson and Sparck-Jones, 1976]: BIR;
[Salton, 1971, Salton et al., 1975]: VSM, TF-IDF;
[Rocchio, 1971]: Relevance feedback; [Maron and Kuhns, 1960]:
On Relevance, Probabilistic Indexing, and IR
17 / 133

IR Models
Introduction
Timeline of IR Models: 80s
[Cooper, 1988, Cooper, 1991, Cooper, 1994]: Beyond Boole,
Probability Theory in IR: An Encumbrance;
[Dumais et al., 1988, Deerwester et al., 1990]: Latent semantic
indexing; [van Rijsbergen, 1986, van Rijsbergen, 1989]: P(d ! q);
[Bookstein, 1980, Salton et al., 1983]: Fuzzy, extended Boolean
18 / 133

IR Models
Introduction
[Ponte and Croft, 1998]: LM;
[Brin and Page, 1998, Kleinberg, 1999]: Pagerank and Hits;
[Robertson et al., 1994, Singhal et al., 1996]: Pivoted Document
Length Normalisation; [Wong and Yao, 1995]: P(d ! q);
[Robertson and Walker, 1994, Robertson et al., 1995]: 2-Poisson,
BM25; [Margulis, 1992, Church and Gale, 1995]: Poisson;
[Fuhr, 1992]: Probabilistic Models in IR;
[Turtle and Croft, 1990, Turtle and Croft, 1991]: PIN's;
[Fuhr, 1989]: Models for Probabilistic Indexing
19 / 133

IR Models
Introduction
ICTIR 2009 and ICTIR 2011; [Roelleke and Wang, 2008]: TF-IDF
Uncovered; [Luk, 2008, Robertson, 2005]: Event Spaces;
[Roelleke and Wang, 2006]: Parallel Derivation of Models;
[Fang and Zhai, 2005]: Axiomatic approach; [He and Ounis, 2005]:
TF in BM25 and DFR; [Metzler and Croft, 2004]: LM and
PIN's;[Robertson, 2004]: Understanding IDF;
[Sparck-Jones et al., 2003]: LM and Relevance;
[Croft and Laerty, 2003, Laerty and Zhai, 2003]: LM book;
[Zaragoza et al., 2003]: Bayesian extension to LM;
[Bruza and Song, 2003]: probabilistic dependencies in LM;
[Amati and van Rijsbergen, 2002]: DFR;
[Lavrenko and Croft, 2001]: Relevance-based LM;
[Hiemstra, 2000]: TF-IDF and LM; [Sparck-Jones et al., 2000]:
probabilistic model: status
20 / 133

IR Models
Introduction
Timeline of IR Models: 2010 and Beyond
Models for interactive and dynamic IR (e.g.
iPRP [Fuhr, 2008])
Quantum models
[van Rijsbergen, 2004, Piwowarski et al., 2010]
21 / 133

IR Models
Introduction
Notation
Notation
A tedious start ... but a must-have.
Sets
Locations
Documents
Terms
Probabilities
23 / 133

IR Models
Introduction
Notation
Notation: Sets
Notation description of events, sets, and frequencies
t, d, q, c, r term t, document d, query q, collection c, rele-
vant r
Dc , Dr Dc = fd1; : : :g: set of Documents in collection c;
Dr : relevant documents
Tc , Tr Tc = ft1; : : :g: set of Terms in collection c; Tr :
terms that occur in relevant documents
Lc , Lr Lc = fl1; : : :g; set of Locations in collection c; Lr :
locations in relevant documents
24 / 133

IR Models
Introduction
Notation
Notation: Locations
Notation description of events, sets, and
frequencies
Traditional notation
nL(t; d) number of Locations at which
term t occurs in document d
tf, tfd
NL(d) number of Locations in docu-
ment d (document length)
dl
nL(t; q) number of Locations at which
term t occurs in query q
qtf, tfq
NL(q) number of Locations in query q
(query length)
ql
25 / 133

IR Models
Introduction
Notation
Notation: Locations
frequencies
nL(t; c) number of Locations at which
term t occurs in collection c
TF, cf(t)
NL(c) number of Locations in collec-
tion c
nL(t; r ) number of Locations at which
term t occurs in the set Lr
NL(r ) number of Locations in the set
Lr
26 / 133

IR Models
Introduction
Notation
Notation: Documents
frequencies
nD(t; c) number of Documents in
which term t occurs in the set
Dc of collection c
nt , df(t)
ND(c) number of Documents in the
set Dc of collection c
N
nD(t; r ) number of Documents in
which term t occurs in the set
Dr of relevant documents
rt
ND(r ) number of Documents in the
set Dr of relevant documents
R
27 / 133

IR Models
Introduction
Notation
Notation: Terms
frequencies
nT (d; c) number of Terms in docu-
ment d in collection c
NT (c) number of Terms in collec-
tion c
28 / 133

IR Models
Introduction
Notation
Notation: Average and Pivoted Length
Let u denote a collection associated with a set of documents. For
example: u = c, or u = r , or u = r .
Notation description of events, sets, and frequen-
cies
avgdl(u) average document length: avgdl(u) =
NL(u)=ND(u) (avgdl if collection im-
plicit)
avgdl
pivdl(d; u) pivoted document length: pivdl(d; u) =
NL(d)=avgdl(u) = dl=avgdl(u)
(pivdl(d) if collection implicit)
pivdl
(t; u) average term frequency over all docu-
ments in Du: nL(t; u)=ND(u)
avgtf(t; u) average term frequency over elite docu-
ments in Du: nL(t; u)=nD(t; u)
29 / 133

IR Models
Introduction
Notation
Notation: Location-based Probabilities
Notation Description of Probabili-
ties
PL(tjd) := nL(t;d)
NL(d) Location-based within-
document term probabil-
ity
P(tjd) = tfd
jdj , jdj = dl = NL(d)
PL(tjq) := nL(t;q)
NL(q) Location-based within-
query term probability
P(tjq) = tfq
jqj , jqj = ql = NL(q)
PL(tjc) := nL(t;c)
NL(c) Location-based within-
collection term probabil-
ity
P(tjc) = tfc
jcj , jcj = NL(c)
PL(tjr ) := nL(t;r )
NL(r ) Location-based within-
relevance term probabil-
ity
Event space PL: Locations (LM, TF)
30 / 133

IR Models
Introduction
Notation
Notation: Document-based Probabilities
Notation Description of Probabilities Traditional notation
PD(tjc) := nD(t;c)
ND(c) Document-based within-
collection term probability
P(t) = nt
N , N = ND(c)
PD(tjr ) := nD(t;r )
ND(r ) Document-based within-
relevance term probability
P(tjr ) = rt
R , R = ND(r )
PT (djc) := nT (d;c)
NT (c) Term-based document proba-
bility
Pavg(tjc) := avgtf(t;c)
avgdl(c) probability that t occurs
in document with average
length; avgtf(t; c) avgdl(c)
Event space PD: Documents (BIR, IDF)
31 / 133

IR Models
Introduction
Notation
Toy Example
Notation Value
NL(c) 20
ND(c) 10
avgdl(c) 20/10=2
Notation Value
doc1 doc2 doc3
NL(d) 2 3 3
pivdl(d; c) 2/2 3/2 3/2
Notation Value
sailing boats
nL(t; c) 8 6
nD(t; c) 6 5
PL(tjc) 8/20 6/20
PD(tjc) 6/10 5/10
(t; c) 8/10 6/10
avgtf(t; c) 8/6 6/5
32 / 133

Foundations of IR Models
TF-IDF
PRF: The Probability of Relevance Framework
BIR: Binary Independence Retrieval
Poisson and 2-Poisson
BM25
LM: Language Modelling
PIN's: Probabilistic Inference Networks
Relevance-based Models
Foundations: Summary

TF-IDF

IR Models
TF-IDF
TF-IDF
Still a very popular model
Best known outside IR research, ery intuitive
TF-IDF is not a model; it is just a weighting scheme in the
vector space model
TF-IDF is purely heuristic; it has no probabilistic roots.
But:
TF-IDF and LM are dual models that can be shown to be
derived from the same root.
Simpli

IR Models
TF-IDF
TF Variants: TF(t; d)
TFtotal(t; d) := lftotal(t; d) := nL(t; d) (= tfd )
TFsum(t; d) := lfsum(t; d) :=
nL(t; d)
NL(d)
= PL(tjd)

=
tfd
dl

TFmax(t; d) := lfmax(t; d) :=
nL(t; d)
nL(tmax; d)
TFlog(t; d) := lflog(t; d) := log(1 + nL(t; d)) (= log(1 + tfd ))
TFfrac;K (t; d) := lffrac;K (t; d) :=
nL(t; d)
nL(t; d) + Kd

=
tfd
tfd + Kd

TFBM25;k1;b(t; d) := ::: :=
nL(t; d)
nL(t; d) + k1 (b pivdl(d; c) + (1 b))
36 / 133

IR Models
TF-IDF
TF Variants: Collection-wide TF(t; c)
Analogously to TF(t; d), the next de

nes the variants of
TF(t; c), the collection-wide term frequency.
Not considered any further here.
37 / 133

IR Models
TF-IDF
TFtotal and TFlog
0 50 100 150 200
0 50 100 150 200
nL(t,d)
tftotal
0 50 100 150 200
0 1 2 3 4 5
nL(t,d)
tflog
Bias towards documents with many terms
(e.g. books vs. Twitter tweets)
TFtotal:
too steep
assumes all occurrences are independent
(same impact)
TFlog:
less impact to subsequent occurrences
the base of the logarithm is ranking
invariant, since it is a constant:
TFlog;base(t; d) :=
ln(1 + tfd )
ln(base)
38 / 133

IR Models
TF-IDF
Logarithmic TF: Dependence Assumption
The logarithmic TF assigns less impact to subsequent occurrences
than the total TF does. This aspect becomes clear when
reconsidering that the logarithm is an approximation of the
harmonic sum:
TFlog(t; d) = ln(1 + tfd ) 1 +
1
2
+ : : : +
1
tfd
tfd 0
Note: ln(n + 1) =
R n+1
1
1
x dx.
Whereas: TFtotal(t; d) = 1 + 1 + ::: + 1
The

rst occurrence of a term counts in full, the second
counts 1=2, the third counts 1=3, and so forth.
This gives a particular insight into the type of dependence that
is re
ected by bending the total TF into a saturating curve.
39 / 133

IR Models
TF-IDF
TFsum, TFmax and TFfrac: Graphical Illustration
1.0
0.8 0.6 tfmax: NL(tmax d)=400
0.4 0.2 tfmax: NL(tmax d)=500
0.0 0 50 100 150 200 nL(t,d)
tf
tfsum: NL(d)=200
tfsum: NL(d)=2000
0 50 100 150 200
0.0 0.2 0.4 0.6 0.8 1.0
nL(t,d)
tffrac
tffrac K=1
tffrac K=5
tffrac K=10
tffrac K=20
tffrac K=100
tffrac K=200
tffrac K=1000
40 / 133

IR Models
TF-IDF
TFsum, TFmax and TFfrac: Analysis
Document length normalisation (TFfrac: K may depend on
document length)
Usually TFmax yields higher TF-values than TFsum
Linear TF variants are not really important anymore, since
TFfrac (TFBM25) delivers better and more stable quality
TFfrac yields relatively high TF-values already for small
frequencies, and the curve saturates for large frequencies
The good and stable performance of BM25 indicates that this
non-linear nature is key for achieving good retrieval quality.
41 / 133

IR Models
TF-IDF
Fractional TF: Dependence Assumption
What we refer to as fractional TF, is a ratio:
ratio(x; y) =
x
x + y
=
tfd
tfd + Kd
= TFfrac;K(t; d)
The ratio is related to the harmonic sum of squares.
n
n + 1
1 +
1
22 + : : : +
1
n2 n 0
This approximation is based on the following integral:
Z n+1
1
1
z2 dz =

1
z
n+1
1
= 1
1
n + 1
=
n
n + 1
TFfrac assumes more dependence than TFlog. k-th occurrence of a
term has an impact of 1=k2.
42 / 133

IR Models
TF-IDF
TFBM25
TFBM25 is a special TFfrac used in the BM25 model
K is proportional to the pivoted document length
(pivdl(d; c) = dl=avgdl(c)) and involves adjustment
parameters (k1, b).
The common de

nition is:
KBM25;k1;b(d; c) := k1 (b pivdl(d; c) + (1 b))
For b = 1, K is equal to k1 for average documents, less than
k1 for short documents and greater than k1 for long
documents.
Large b and k1 lead to a strong variation of K with an high
impact on the retrieval score
Documents shorter than the average have an advantage over
documents longer than the average 43 / 133

IR Models
TF-IDF
Inverse Document Frequency IDF
The IDF (inverse document frequency) is the negative
logarithm of the DF (document frequency).
Idea: the less documents a term appears in, the more
discrimiative or 'informative' it is
44 / 133

IR Models
TF-IDF
DF Variants
DF(t; c) is a quanti

cation of the document frequency, df(t; c).
The main variants are:
df(t; c) := dftotal(t; c) := nD(t; c)
dfsum(t; c) :=
nD(t; c)
ND(c)
= PD(tjc)

=
df(t; c)
ND(c)

dfsum,smooth(t; c) :=
nD(t; c) + 0:5
ND(c) + 1
dfBIR(t; c) :=
nD(t; c)
ND(c) nD(t; c)
dfBIR,smooth(t; c) :=
nD(t; c) + 0:5
ND(c) nD(t; c) + 0:5
45 / 133

IR Models
TF-IDF
IDF Variants
IDF(t; c) is the negative logarithm of a DF quanti

cation. The
main variants are:
idftotal(t; c) := log dftotal(t; c)
idf(t; c) := idfsum(t; c) := log dfsum(t; c) = log PD(tjc)
idfsum,smooth(t; c) := log dfsum,smooth(t; c)
idfBIR(t; c) := log dfBIR(t; c)
idfBIR,smooth(t; c) := log dfBIR,smooth(t; c)
IDF is high for rare terms and low for frequent terms.
46 / 133

IR Models
TF-IDF
Burstiness
A term is bursty is it occurs often in the documents in which
it occurs
Burstiness is measured by the average term frequency in the
elite set:
avgtf(t; c) =
nL(t; c)
nD(t; c)
Intuition (relevant vs. non-relevant documents):
A good term is rare (not frequent, high IDF) and solitude (not
bursty, low avgtf) in all documents (all non-relevant
documents)
Among relevant documents, a good term is frequent (low IDF,
appears in many relevant documents) and bursty (high avgtf)
47 / 133

IR Models
TF-IDF
IDF and Burstiness
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2 0.4 0.6 0.8 1
PD(t|c)
idf(t,c)
l(t,c) = 2
l(t,c) = 1
BURSTY
FREQUENT
l(t,c) = 4
10
8
6
4
2
BURSTY
RARE
avgtf(t,c)
SOLITUDE
RARE
SOLITUDE
FREQUENT
0.1 0.2 0.3 0.4 0.5 P (t|c)
BIR
Example: nD(t1; c) = nD(t2; c) = 1; 000. Same IDF.
nL(t1; d) = 1 for 1,000 documents, whereas
nL(t2; d) = 1 for 999 documents, and nL(t2) = 1; 001 for one doc.
nL(t1; c) = 1; 000 and nL(t2; c) = 2; 000.
avgtf(t1; c) = 1 and avgtf(t2; c) = 2.
(t; c) = avgtf(t; c) PD(tjc)
48 / 133

IR Models
TF-IDF
TF-IDF Term Weight
De

nition (TF-IDF term weight wTF-IDF)
wTF-IDF(t; d; q; c) := TF(t; d) TF(t; q) IDF(t; c)
49 / 133

IR Models
TF-IDF
TF-IDF RSV
De

nition (TF-IDF retrieval status value RSVTF-IDF)
RSVTF-IDF(d; q; c) :=
X
t
wTF-IDF(t; d; q; c)
RSVTF-IDF(d; q; c) =
X
t
TF(t; d) TF(t; q) IDF(t; c)
50 / 133

IR Models
TF-IDF
Probabilistic IDF: Probability of Being Informative
IDF so far is not probabilistic
In probabilistic scenarios, a normalised IDF value such as
0
idf(t; c)
maxidf(c)
1
can be useful
maxidf(c) := log 1
ND(c) = log(ND(c)) is
8 9 10 11 12
the maximal value of idf(t; c) (when a term
occurs only in 1 document) 0 50000 100000 150000 200000
Nd(c)
maxidf
The normalisation does not aect the ranking
51 / 133

IR Models
TF-IDF
Probability that term t is informative
A probabilistic semantics of a max-normalised IDF can be achieved
by introducing an informativeness-based probability,
[Roelleke, 2003], as opposed to the normal notion of
occurrence-based probability, and we denote the probability as
P(t informsjc), to contrast it from the usual
P(tjc) := P(t occursjc) = nD(t;c)
ND(c) .
52 / 133

PRF: The Probability of Relevance
Framework

IR Models
Relevance is at the core of any information retrieval model
With r denoting relevance, d a document and q a query, the
probability that a document is relevant is
P(r jd; q)
54 / 133

IR Models
Probability Ranking Principle
[Robertson, 1977],The Probability Ranking Principle (PRP) in
IR, describes the PRP as a framework to discuss formallywhat is
a good ranking? [Robertson, 1977] quotes Cooper's formal
statement of the PRP:
If a reference retrieval system's response to each request
is a ranking of the documents in the collections in order
of decreasing probability of usefulness to the user who
submitted the request, ..., then the overall eectiveness
of the system ... will be the best that is obtainable on
the basis of that data.
Formally, we can capture the principle as follows. Let A and B be
rankings. Then, a ranking A is better than a ranking B if at every
rank, the probability of satisfaction in A is higher than for B, i.e.:
8rank : P(satisfactoryjrank; A) P(satisfactoryjrank;B) 55 / 133

IR Models
PRF: Illustration
Example (Probability of relevance)
Let three users u1; u2; u3 have judged document-query pairs.
User Doc Query Judgement R
u1 d1 q1 r
u1 d2 q1 r
u1 d3 q1 r
u2 d1 q1 r
u2 d2 q1 r
u2 d3 q1 r
u3 d1 q1 r
u3 d2 q1 r
u4 d1 q1 r
u4 d2 q1 r
PU(r jd1; q1) = 3=4, and PU(r jd2; q1) = 1=4, and PU(r jd3; q1) = 0=2.
Subscript in PU: event space, a set of users.
Total probability:
PU(r jd; q) =
X
u2U
P(r jd; q; u) P(u)
56 / 133

IR Models
Bayes Theorem
Relevance judgements can be incomplete
In any case, for a new query we often do not have judgements
The probability of relevance is estimated via Bayes'
Theorem:
P(r jd; q) =
P(d; qjr ) P(r )
P(d; q)
=
P(d; q; r )
P(d; q)
Decision whether or not to retrieve a document is based on
the so-calledBayesian decision rule:
retrieve document d, if the probability of relevance is greater
than the probability of non-relevance:
retrieve d for q if P(r jd; q) P(r jd; q)
57 / 133

IR Models
Probabilistic Odds, Rank Equivalence
We can express the probabilistic odds of relevance
O(r jd; q) =
P(r jd; q)
P(r jd; q)
=
P(d; q; r )
P(d; q; r )
=
P(d; qjr )
P(d; qjr )

P(r )
P(r )
Since P(r )=P(r ) is a constant, the following rank equivalence
holds:
O(r jd; q) rank =
P(d; qjr )
P(d; qjr )
Often we don't need the exact probability, but an
easier-to-compute value that is rank equivalent (i.e. it preserves the
ranking)
58 / 133

IR Models
Probabilistic Odds
The document-pair probabilities can be decomposed in two ways:
P(d; qjr )
P(djq; r ) P(qjr )
=
P(d; qjr )
P(djq; r ) P(qjr )
() BIR/Poisson/BM25)
=
P(qjd; r ) P(djr )
P(qjd; r ) P(djr )
() LM?)
The equation where d depends on q, is the basis of BIR,
Poisson and BM25...document likelihood P(djq; r )
The equation where q depends on d has been related to LM,
P(qjd), [Laerty and Zhai, 2003], Probabilistic Relevance
Models Based on Document and Query Generation
This relationship and the assumptions required to establish it,
are controversial [Luk, 2008]
59 / 133

IR Models
Documents as Feature Vectors
The next step represents document d as a
vector ~d = (f1; : : : ; fn) in a space of features ~f
i :
P(djq; r ) = P(~djq; r )
A feature could be, for example, the frequency of a word
(term), the document length, document creation time, time of
last update, document owner, number of in-links, or number
of out-links.
See also the Vector Space Model in the Introduction { here
we used term weights as features
Assumptions based on features:
Feature independence assumption
Non-query term assumption
Term frequency split
60 / 133

IR Models
Feature Independence Assumption
The features (terms) are independent events:
P(~djq; r )
Y
i
P(fi jq; r )
Weaker assumption for the fraction of feature probabilities
(linked dependence):
P(djq; r )
P(djq; r )

Y
i
P(fi jq; r )
P(fi jq; r )
Here it is not required to distinguish between these two
assumptions
P(fi jq; r ) and P(fi jq; r ) may be estimated for instance by
means of relevance judgements
61 / 133

IR Models
Non-Query Term Assumption
Non-query terms can be ignored for retrieval (ranking)
This reduces the number of features/terms/dimensions to
consider when computing probabilities
For non-query terms, the feature probability is the same in relevant
documents and non-relevant documents.
for all non-query terms:
P(fi jq; r )
P(fi jq; r )
= 1
Then
62 / 133

IR Models
Term Frequency Split
The product over query terms is split into two parts
First part captures the fi 0 features, i.e. the document terms
Second part captures the fi = 0 features, i.e. the
non-document terms
63 / 133

IR Models
The BIR instantiation of the PRF assumes the vector
components to be binary term features,
i.e. ~d = (x1; x2; : : : ; xn), where xi 2 f0; 1g
Term occurrences are represented in a binary feature vector ~d
in the term space
The event xt = 1 is expressed as t, and xt = 0 as t
65 / 133

IR Models
BIR Term Weight
We save the derivation ... after few steps from O(r jd; q) ...
Event Space: d and q are binary vectors.
De

nition (BIR term weight wBIR)
wBIR(t; r ; r ) := log

PD(tjr )
PD(tjr )

PD(t
jr )
PD(t
jr )

Simpli

ed form (referred to as F1) considering term presence only:
wBIR;F1(t; r ; c) := log
PD(tjr )
PD(tjc)
66 / 133

IR Models
BIR RSV
De

nition (BIR retrieval status value RSVBIR)
RSVBIR(d; q; r ; r ) :=
X
t2dq
wBIR(t; r ; r )
RSVBIR(d; q; r ; r ) =
X
t2dq
log
P(tjr ) P(t
jr )
P(tjr ) P(t
jr )
67 / 133

IR Models
BIR Estimations
Relevance judgements for 20 documents (example
from [Fuhr, 1992]):
di 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
x2 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
R r r r r r r r r r r r r r r r r r r r r
ND(r ) = 12;ND(r ) = 8; P(t1jr ) = P(x1 = 1jr ) = 8=12 = 2=3
P(t1jr ) = 2=8 = 1=4; P( t1jr ) = 4=12 = 1=3; P( t1jr ) = 6=8 = 3=4
Dito for t2.
68 / 133

IR Models
Missing Relevance Information I
Estimation of non-relevant documents:
Take collection-wide term probability as approximation (r c)
P(tjr ) P(tjc)
Use the set of all documents minus the set of relevant
documents (r = cnr )
P(tjr )
nD(t; c) nD(t; r )
ND(c) ND(r )
69 / 133

IR Models
Missing Relevance Information II
Empty set problem: For ND(r ) = ; (no relevant documents),
P(tjr ) is not de

ned. Smoothing deals with this situation.
Add the query to the set of relevant documents and the
collection:
P(tjr ) =
nd (t; r ) + 1
ND(r ) + 1
Other forms of smoothing, e.g.
P(tjr ) =
nd (t; r ) + 0:5
ND(r ) + 1
This variant may be justi

ed by Laplace's law of succession.
70 / 133

IR Models
RSJ Term Weight
Recall wBIR;F1(t; r ; c) := log PD(tjr )
PD(tjc)
De

nition (RSJ term weight wRSJ)
The RSJ term weight is a smooth BIR term weight.
The probability estimation is as follows:
P(tjr ) := (nD(t; r ) + 0:5)=(ND(r ) + 1);
P(tjr ) := ((nD(t; c)+1)(nD(t; r )+0:5))=((ND(c)+2)(ND(r )+1)).
wRSJ;F4(t; r ; r ; c) :=
log

(nd (t; r ) + 0:5)=(ND(r ) nd (t; r ) + 0:5)
(nD(t; c)nD(t; r )+0:5)=((ND(c)ND(r )) (nD(t; c)nD(t; r )) + 0:5)

71 / 133

IR Models
Poisson Model I
Less known model; however, there are good reasons to look at this
model
Demystify the Poisson probability { some research students
resign when hearingPoisson ;-)
Next to the BIR model the natural instantiation of a PRF
model; the BIR model is a special case of the Poisson model
The 2-Poisson probability is arguably the foundation of the
BM25-TF quanti

cation [Robertson and Walker, 1994],Some
Simple Eective Approximations to the 2-Poisson Model.
73 / 133

IR Models
Poisson Model II
The Poisson probability is a model of randomness. Divergence
from randomness (DFR), which is based on the probability
P(t 2 djcollection) = P(tfd 0jcollection). The probability
P(tfd jcollection) can be estimated by a Poisson probability.
The Poisson parameter (t; c) = nL(t; c)=ND(c), i.e. the
average number of term occurrences, relates Document-based
and Location-based probabilities.
avgtf(t; c) PD(tjc) = (t; c) = avgdl(c) PL(tjc)
We refer to this relationship as Poisson Bridge since the
average term frequency is the parameter of the Poisson
probability.
74 / 133

IR Models
Poisson Model III
The Poisson model yields a foundation of TF-IDF (see Part II)
The Poisson bridge helps to relate TF-IDF and LM (see
Part II)
75 / 133

IR Models
Poisson Distribution
Let t be an event that occurs in average t times. The Poisson
probability is:
PPoisson;t (k) :=
kt
k!
et
76 / 133

IR Models
Poisson Example
The probability that k = 4 sunny days
occur in a week, given the average
= p n = 180=360 7 = 3:5 sunny
days per week, is:
0.00 0.05 0.10 0.15 0.20
k
P3.5(k)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
PPoisson;=3:5(k = 4) =
(3:5)4
4!
e3:5 0:1888
77 / 133

IR Models
Poisson PRF: Basic Idea
The Poisson model could be used to estimate the document
probability P(djq; r ) as the product of the probabilities of the
within-document term frequencies kt :
P(djq; r ) = P(~djq; r ) =
Y
t
P(kt jq; r )
kt := tfd := nL(t; d)
~d is a vector of term frequencies (cf. BIR: binary vector of
occurrence/non-occurrence)
78 / 133

IR Models
Poisson PRF: Odds
Rank-equivalence, non-query term assumption and probabilistic
odds lead us to:
O(r jd; q) rank =
Y
t2q
P(kt jr )
P(kt jr )
; kt := tfd := nL(t; d)
Splitting the product into document and non-document terms
yields:
O(r jd; q) rank =
Y
t2dq
P(kt jr )
P(kt jr )

Y
t2qnd
P(0jr )
P(0jr )
79 / 133

IR Models
Poisson PRF: Meaning of Poisson Probability
For the set r (r ) and a document d of length dl, we look at:
How many times would we expect to see a term t in d?
How many times do we actually observe t in d (= kt)?
P(kt jr ) is highest if our expectation is met by our observation.
(t; d; r ) = dl PL(tjr ) is a number of expected occurrences in a
document with length dl (how many times will we draw t if we had
dl trials?)
P(kt jr ) = PPoisson;(t;d;r )(kt ): probability to observe kt occurrences
of term t in dl trials
80 / 133

IR Models
Poisson PRF: Example
dl = 5; PL(tjr ) = 5=10 = 1=2
(t; d; r ) = 5 1=2 = 2:5
P(2jr ) = PPoisson;2:5(2) = 0:2565156
0.00 0.05 0.10 0.15 0.20 0.25
k
P2.5(k)
0 1 2 3 4 5 6 7 8 9 10
81 / 133

IR Models
Poisson Term Weight
Event Space: d and q are frequency vectors.
De

nition (Poisson term weight wPoisson)
wPoisson(t; d; q; r ; r ) := TF(t; d) log
(t; d; r )
(t; d; r )
wPoisson(t; d; q; r ; r ) = TF(t; d) log
PL(tjr )
PL(tjr )
(t; d; r ) = dl PL(tjr )
TF(t; d) = kt (smarter TF quanti

IR Models
Poisson RSV
De

nition (Poisson retrieval status value RSVPoisson)
RSVPoisson(d; q; r ; r ) :=
2
4
X
t2dq
wPoisson(t; d; q; r ; r )
3
5+
len normPoisson
RSVPoisson(d; q; r ; r ) =
2
4
X
t2dq
TF(t; d) log
PL(tjr )
PL(tjr )
3
5+
dl
X
t
(PL(tjr ) PL(tjr ))
83 / 133

IR Models
2-Poisson
Viewed as a motivation for the BM25-TF quanti

cation,
[Robertson and Walker, 1994], Simple Approximations to the
2-Poisson Model. In an exchange with Stephen Robertson, he
explained:
The investigation into the 2-Poisson probability
motivated the BM25-TF quanti

cation.Regarding the
combination of TF and RSJ weight in BM25, TF can be
viewed as a factor to re
ect the uncertainty about
whether the RSJ weight wRSJ is correct; for terms with a
relatively high within-document TF, the weight is correct;
for terms with a relatively low within-document TF, there
is uncertainty about the correctness. In other words, the
TF factor can be viewed as a weight to adjust the impact
of the RSJ weight.
84 / 133

IR Models
2-Poisson Example: How many cars to expect? I
How many cars are expected on a given commuter car park?
Approach 1: In average, there are 700 cars per week. The
daily average is: = 700=7 = 100 cars/day.
Then, P=100(k) is the probability that there are k cars
wanting to park on a given day.
Estimation is less accurate than an estimation based on a
2-dimensional model { Mo-Fr are the busy days, and on
week-ends, the car park is nearly empty.
This means that a distribution such as (130, 130, 130, 130,
130, 25, 25) is more likely than 100 each day.
85 / 133

IR Models
2-Poisson Example: How many cars to expect? II
Approach 2: In a more detailed analysis, we observe 650 cars
Mon-Fri (work days) and 50 cars Sat-Sun (week-end days).
The averages are: 1 = 650=5 = 130 cars/work-day,
2 = 50=2 = 25 cars/we-day.
Then, P1=5=7;1=130;2=2=7;2=25(k) is the 2-dimensional
Poisson probability that there are k cars looking for a car
park.
86 / 133

IR Models
2-Poisson Example: How many cars to expect? III
Main idea of the 2-Poisson probability: combine (interpolate,
mix) two Poisson probabilities
P2-Poisson;1;2;(kt) :=
kt
1
kt !
e1 + (1 )
kt
2
kt !
e2
Such a mixture model is also used with LM (later)
1 could be over all documents whereas 2 could be over an
elite set (e.g. documents that contain at least one query term)
87 / 133

BM25

IR Models
BM25
BM25
One of the most prominent IR models
The ingredients have already been prepared:
TFBM25;K (t; d)
wRSJ;F4(t; r ; r ; c)
Wikipedia 2014 formulation: Given a query Q containing
keywords q1; : : : ; qn:
score(D;Q) =
Xn
i=1
IDF(qi )
f (qi ;D) (k1 + 1)
f (qi ;D) + k1 (1 b + b jDj
avgdl )
with f (qi ;D) = nL(qi ;D) (qi 's term frequency)
Here:
Ignore (k1 + 1) (ranking invariant!)
Use wRSJ;F4(t; r ; r ; c). If relevance information is missing:
wRSJ;F4(t; r ; r ; c) IDF(t)
89 / 133

IR Models
BM25
BM25 Term Weight
De

nition (BM25 term weight wBM25)
wBM25;k1;b;k3(t; d; q; r ; r ) :=
X
t2dq
TFBM25;k1;b(t; d) TFBM25;k3(t; q) wRSJ(t; r ; r )
TFBM25;k1;b(t; d) :=
tfd
tfd + k1 (b pivdl(d) + (1 b))
90 / 133

IR Models
BM25
BM25 RSV
De

nition (BM25 retrieval status value RSVBM25)
RSVBM25;k1;b;k2;k3(d; q; r ; r ; c) := 2
4
X
t2dq
wBM25;k1;b;k3(t; d; q; r ; r )
3
5 + len normBM25;k2
Additional length normalisation
len normBM25;k2(d; q; c) := k2 ql
avgdl(c) dl
avgdl(c) + dl
(ql is query length). Suppresses long documents (negative length
norm).
91 / 133

IR Models
BM25
BM25: Summary
We have discussed the foundations of BM25, an instance of the
probability relevance framework (PRF).
TF quanti

cation TFBM25 using TFfrac and the pivoted
document length
TFBM25 can be related to probability theory through
semi-subsumed event occurrences (out of scope of this talk)
RSJ term weight wRSJ as smooth variant of the BIR term
weight (0.5-smoothing can be explained through Laplace's
law of succession)
92 / 133

IR Models
BM25
Document- and Query-Likelihood Models
We are now leaving the world of document-likelihood models and
move towards LM, the query-likelihood model.
P(d; qjr )
P(d; qjr )
=
P(djq; r ) P(qjr )
P(djq; r ) P(qjr )
() BIR/Poisson/BM25)
=
P(qjd; r ) P(djr )
P(qjd; r ) P(djr )
() LM?)
93 / 133

IR Models
Language Modelling (LM)
Popular retrieval model since the late 90s
[Ponte and Croft, 1998, Hiemstra, 2000,
Croft and Laerty, 2003]
Compute probability P(qjd) that a document generates a
query
Q
q is a conjunction of term events: P(qjd) /
P(tjd)
Zero-probability problem: terms t that don't appear in d lead
to P(tjd) = 0 and hence P(qjd) = 0
Mix the within-document term probability P(tjd) and the
collection-wide term probability P(tjc)
95 / 133

IR Models
Probability Mixture
Let three events x; y; z, and two conditional probabilities P(zjx)
and P(zjy) be given.
Then, P(zjx; y) can be estimated as a linear combination/mixture
of P(zjx) and P(zjy).
P(zjx; y) x P(zjx) + (1 x ) P(zjy)
Here, 0 x 1 is the mixture parameter.
The mixture parameters can be constant (Jelinek-Mercer mixture),
or can be set proportional to the total probabilities.
96 / 133

IR Models
Probability Mixture Example
Let P(sunny;warm; rainy; dry; windyjglasgow) describe the
probability that a day in Glasgow is sunny, the next day is
warm, the next rainy, and so forth
If for one event (e.g. sunny), the probability were zero, then
the probability of the conjunction (product) is zero. A mixture
solves the problem.
For example, mix P(xjglasgow) with P(xjuk) where
P(xjuk) 0 for each event x.
Then, in a week in winter, when P(sunnyjglasgow) = 0, and
for the whole of the UK, the weather oce reports 2 of 7 days
as sunny, the mixed probability is:
P(sunnyjglasgow; uk) =
0
7
+ (1 )
2
7
97 / 133

IR Models
LM1 Term Weight
Event Space: d and q are sequences of terms.
De

nition (LM1 term weight wLM1)
P(tjd; c) := d P(tjd) + (1 d ) P(tjc)
wLM1;d (t; d; q; c) := TF(t; q) log
=P(tjd;c) z }| {
(d P(tjd) + (1 d ) P(tjc))
P(tjd): foreground probability
P(tjc): background probability
98 / 133

IR Models
Language Modelling Independence Assumption
We assume terms are independent, meaning the product over the
term probabilities P(tjd) is equal to P(qjd):
Probability that mixture of background and foreground
probabilities generates query as sequence of terms:
t IN q: sequence of terms (e.g.
q = (sailing, boat, sailing))
t 2 q: set
of terms; TF(sailing; q) = 2
Note: log
P(tjd; c)TF(t;q)

= TF(t; q) log (P(tjd; c))
99 / 133

IR Models
LM1 RSV
De

nition (LM1 retrieval status value RSVLM1)
RSVLM1;d (d; q; c) :=
X
t2q
wLM1;d (t; d; q; c)
100 / 133

IR Models
JM-LM Term Weight
For constant , the score can be divided by
Q
t IN q(1 ). This
leads to the following equation, [Hiemstra, 2000]:
P(qjd; c)
P(qjc)
Q
t IN q(1 )
=
Y
t2dq

1 +

1

P(tjd)
P(tjc)
TF(t;q)
De

nition (JM-LM (Jelinek-Mercer) term weight wJM-LM)
wJM-LM;(t; d; q; c) := TF(t; q) log

1 +

1

P(tjd)
P(tjc)

101 / 133

IR Models
JM-LM RSV
De

nition (JM-LM retrieval status value RSVJM-LM)
RSVJM-LM;(d; q; c) :=
X
t2dq
wJM-LM;(t; d; q; c)
RSVJM-LM;(d; q; c) :=
X
t2dq
TF(t; q) log

1 +

1

P(tjd)
P(tjc)

We only need to look at terms that appear in both document and
query.
102 / 133

IR Models
Dirichlet-LM Term Weight
Document-dependent mixture parameter: D = dl
dl+
De

nition (Dirich-LM term weight wDirich-LM)
wDirich-LM;(t; d; q; c) := TF(t; q) log

+jdj
+
jdj
jdj+

P(tjd)
P(tjc)

103 / 133

IR Models
Dirich-LM RSV
De

Information Retrieval Models Part I

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Information Retrieval Models Part I

Similar to Information Retrieval Models Part I (20)

Recently uploaded

Recently uploaded (20)

Information Retrieval Models Part I