Dynamic Information Retrieval Modeling Tutorial

SIGIRTutorial July 7th 2014
Grace Hui Yang
Marc Sloan
JunWang
Guest Speaker: EmineYilmaz
Dynamic Information Retrieval
Modeling

Dynamic Information Retrieval ModelingTutorial 20142

Age of Empire

Documents
to explore Information
need
Observed
documents
User
Devise a strategy for
helping the user
explore the
information space in
order to learn which
documents are
relevant and which
aren’t, and satisfy
their information
need.

Evolving IR
 Paradigm shifts in IR as new models emerge
 e.g.VSM → BM25 → Language Model
 Different ways of defining relationship between
query and document
 Static → Interactive → Dynamic
 Evolution in modeling user interaction with search
engine

Outline
 Introduction
 Static IR
 Interactive IR
 Dynamic IR
 Theory and Models
 Session Search
 Reranking
 GuestTalk: Evaluation

Conceptual Model – Static IR
Static IR
Interactive
IR
Dynamic
IR
 No feedback

Characteristics of Static IR
 Does not learn directly from user
 Parameters updated periodically

Static Information Retrieval
Model
Learning to
Rank

Commonly Used Static IR Models
BM25
PageRank
Language
Model

Feedback in IR

Outline
 Introduction
 Static IR
 Interactive IR
 Dynamic IR
 Session Search
 Reranking

Conceptual Model – Interactive IR
Static IR
Interactive
IR
Dynamic
IR
 Exploit Feedback

Interactive User Feedback
Like, dislike,
pause, skip

Learn the user’s taste
interactively!
At the same time, provide good
recommendations!
Interactive Recommender
Systems

Example - Multi Page Search
Ambiguous
Query

Topic: Car

Topic:Animal

Example – Interactive Search
Click on ‘car’
webpage

Click on ‘Next
Page’

Page 2 results:
Cars

Click on ‘animal’
webpage

Page 2 results:
Animals

Example – Dynamic Search
Topic: Guitar

Example – Dynamic Search
Diversified Page
1
Topics: Cars,
animals, guitars

Toy Example
 Multi-Page search scenario
 User image searches for “jaguar”
 Rank two of the four results over two pages:
𝑟 = 0.5 𝑟 = 0.51 𝑟 = 0.9𝑟 = 0.49

Toy Example – Static Ranking
 Ranked according to PRP
Page 1 Page 2
1.
2.
𝑟 = 0.9
𝑟 = 0.51
1.
2.
𝑟 = 0.5
𝑟 = 0.49

Toy Example – Relevance
Feedback
 Interactive Search
 Improve 2nd page based on feedback from 1st page
 Use clicks as relevance feedback
 Rocchio1 algorithm on terms in image webpage
 𝑤 𝑞
′
= 𝛼𝑤 𝑞 +
𝛽
|𝐷 𝑟|
𝑤 𝑑𝑑∈𝐷 𝑟
−
𝛾
𝐷 𝑛
𝑤 𝑑𝑑∈𝐷 𝑛
 New query closer to relevant documents and
different to non-relevant documents
1Rocchio, J. J., ’71, Baeza-Yates &
Ribeiro-Neto‘99

Feedback
 Ranked according to PRP and Rocchio
Page 1 Page 2
2.
𝑟 = 0.9
𝑟 = 0.51
1.
2.
𝑟 = 0.5
𝑟 = 0.49
1.
*
* Click

Feedback
 No click when searching for animals
Page 1 Page 2
2.
𝑟 = 0.9
𝑟 = 0.51
1.
2.
1.
?
?

Toy Example – Value Function
 Optimize both pages using dynamic IR
 Bellman equation for value function
 Simplified example:
 𝑉 𝑡
𝜃 𝑡
, Σ 𝑡
= max
𝑠 𝑡
𝜃𝑠
𝑡
+ 𝐸(𝑉 𝑡+1
𝜃 𝑡+1
, Σ 𝑡+1
𝐶 𝑡
)
 𝜃 𝑡
, Σ 𝑡
= relevance and covariance of documents for page 𝑡
 𝐶 𝑡 = clicks on page 𝑡
 𝑉 𝑡 =‘value’ of ranking on page 𝑡
 Maximize value over all pages based on estimating feedback

1 0.8 0.1 0
0.8 1 0.1 0
0.1 0.1 1 0.95
0 0 0.95 1
Toy Example - Covariance
 Covariance matrix represents similarity between images

Toy Example – Myopic Value
 For myopic ranking, 𝑉2
= 16.380
Page 1
2.
1.

Toy Example – Myopic Ranking
 Page 2 ranking stays the same regardless of clicks
Page 1 Page 2
2.
1.
2.
1.

Toy Example – Optimal Value
 For optimal ranking, 𝑉2
= 16.528
Page 1
2.
1.

Toy Example – Optimal Ranking
 If car clicked, Jaguar logo is more relevant on next page
Page 1 Page 2
2.
1.
2.
1.

Toy Example – Optimal Ranking
 In all other scenarios, rank animal first on next page
Page 1 Page 2
2.
1.
2.
1.

Interactive vs Dynamic IR
• Treats interactions
independently
• Responds to
immediate
feedback
• Static IR used
before feedback
received
• Optimizes over
all interaction
• Long term gains
• Models future
user feedback
• Also used at
beginning of
interaction
Interactive Dynamic

Outline
 Introduction
 Static IR
 Interactive IR
 Dynamic IR
 Session Search
 Reranking

Conceptual Model – Dynamic IR
Static IR
Interactive
IR
Dynamic
IR
 Explore and exploit Feedback

Characteristics of Dynamic IR
Rich interactions
 Query formulation
 Document clicks
 Document examination
 eye movement
 mouse movements
 etc.

Temporal dependency
clicked
documentsquery
D1
ranked documents
q1 C1
D2
q2 C2
……
…… Dn
qn Cn
I
information need
iteration 1 iteration 2 iteration n

Overall goal
Optimize over all iterations for goal
IR metric or user satisfaction
Optimal policy

Dynamic IR
 Dynamic IR explores actions
 Dynamic IR learns from user and adjusts its
actions
 May hurt performance in a single stage, but
improves over all stages

Applications to IR
 Dynamics found in lots of different aspects of IR
 Dynamic Users
 Users change behaviour over time, user history
 Dynamic Documents
 Information Filtering, document content change
 Dynamic Queries
 Changing query definition i.e.‘Twitter’
 Dynamic Information Needs
 Topic ontologies evolve over time
 Dynamic Relevance
 Seasonal/time of day change in relevance

User Interactivity in DIR
 Modern IR interfaces
 Facets
 Verticals
 Personalization
 Responsive to particular user
 Complex log data
 Mobile
 Richer user interactions
 Ads
 Adaptive targeting

Big Data
 Data set sizes are always increasing
 Computational footprint of learning to rank
 Rich, sequential data
1Yin He et. al, ’11
 Complex user model behaviour found in data, takes into
account reading, skipping and re-reading behaviours1
 Uses a POMDP
Example

Online Learning to Rank
 Learning to rank iteratively on sequential data
 Clicks as implicit user feedback/preference
 Often uses multi-armed bandit techniques
1Katja Hofmann et. al., ’11
2YisongYue et. al.,‘09
 Uses click models to interpret clicks and a contextual
bandit to improve learning1
 Pairwise comparison of rankings using duelling bandits
formulation2
Example

Evaluation
 Use complex user interaction data to assess rankings
 Compare ranking techniques in online testing
 Minimise user dissatisfaction
1Jeff Huang et. al.,‘11
2Olivier Chapelle et. al.,‘12
 Modelled cursor activity and correlated with eye tracking to
validate good or bad abandonment1
 Interleave search results from two ranking algorithms to
determine which is better2
Example

Filtering and News
 Adaptive techniques to personalize information filtering
or news recommendation
 Understand the complex dynamics of real world events
in search logs
 Capture temporal document change1
1Dennis Fetterly et. al.,‘03
2Stephen Robertson,‘02
3Jure Leskovec et. al.,‘09
 Uses relevance feedback to adapt threshold sensitivity over
time in information filtering to maximise overal utility1
 Detected patterns and memes in news cycles and modeled
how information spreads2
Example

Advertising
 Behavioural targeting and personalized ads
 Learn when to display new ads
 Maximise profit from available ads
1ShuaiYuan et. al.,‘12
2ZeyuanAllen Zhu et. al.,‘10
 Uses a POMDP and ad correlation to find the optimal ad to
display to a user1
 Dynamic click model that can interpret complex user
behaviour in logs and apply results to tail queries and unseen
ads2
Example

Outline
 Introduction
 Session Search
 Reranking

Outline
 Introduction
 Why not use supervised learning
 Markov Models
 Session Search
 Reranking
 Evaluation

Why not use Supervised Learning
for Dynamic IR Modeling?
 Lack of enough training data
 Dynamic IR problems contain a sequence of dynamic interactions
 E.g. a series of queries in session
 Rare to find repeated sequences (close to zero)
 Even in large query logs (WSCD 2013 & 2014, query logs fromYandex)
 Chance of finding repeated adjacent query pairs is
also low
Dataset Repeated Adjacent
Query Pairs
Total Adjacent
Query Pairs
Repeated
Percentage
WSCD 2013 476,390 17,784,583 2.68%
WSCD 2014 1,959,440 35,376,008 5.54%

Our Solution
Try to find an optimal solution through a
sequence of dynamic interactions
Trial and Error:
learn from repeated, varied attempts which
are continued until success
No Supervised Learning

Trial and Error
 q1 – "dulles hotels"
 q2 – "dulles airport"
 q3 – "dulles airport location"
 q4 – "dulles metrostop"

 Rich interactions
Query formulation, Document clicks, Document examination,
eye movement, mouse movements, etc.
 Temporal dependency
 Overall goal
Recap – Characteristics of
Dynamic IR

 Model interactions, which means it needs to have place holders for
actions;
 Model information need hidden behind user queries and other
interactions;
 Set up a reward mechanism to guide the entire search algorithm to adjust
its retrieval strategies;
 Represent Markov properties to handle the temporal dependency.
What is a Desirable Model for
Dynamic IR
A model inTrial and Error setting will do!
A Markov Model will do!

Outline
 Introduction
 Why not use supervised learning
 Markov Models
 Session Search
 Reranking
 Evaluation

Markov Process
 Markov Property1 (the “memoryless” property)
for a system, its next state depends on its current state.
Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)
 Markov Process
a stochastic process with Markov property.
e.g.
Dynamic Information Retrieval ModelingTutorial 201460 1A.A. Markov,‘06
s0 s1
…… si
……si+1

 Markov Chain
 Hidden Markov Model
 Markov Decision Process
 Partially Observable Markov Decision Process
 Multi-armed Bandit
Family of Markov Models

A
Pagerank(A)
 Discrete-time Markov process
 Example: Google PageRank1
Markov Chain
B
Pagerank(B)
𝑃𝑎𝑔𝑒𝑟𝑎𝑛𝑘 𝑆 =
1 − 𝛼
𝑁
+ 𝛼
𝑃𝑎𝑔𝑒𝑟𝑎𝑛𝑘(𝑌)
𝐿(𝑌)
𝑌∈Π
# of pages # of outlinks
pages linked to S
D
Pagerank(D)
C
Pagerank(C)
E
Pagerank(E)
Random jump factor
1L. Page et. al.,‘99
The stable state distribution of such an MC is PageRank
 State S – web page
 Transition probability M
 PageRank: how likely a random web
surfer will land on a page
(S, M)

Hidden Markov Model
 A Markov chain that states are hidden and observable
symbols are emitted with some probability according to its
states1.
s0 s1 s2
……
o0 o1 o2
p0
𝑒0
p1 p2
𝑒1 𝑒2
Si– hidden state pi -- transition probability oi --observation
ei --observation probability (emission probability)
1Leonard E. Baum et. al.,‘66
(S, M, O, e)

An HMM example for IR
Construct an HMM for each document1
s0 s1 s2 ……
t0 t1 t2
p0
𝑒0
p1 p2
𝑒1 𝑒2
Si– “Document” or
“General English”
pi –a0 or a1
ti – query term
ei – Pr(t|D) or Pr(t|GE)
P(D|q)∝ (𝑎0 𝑃 𝑡 𝐺𝐸 + 𝑎1 𝑃(𝑡|𝐷))𝑡∈𝑞
Document-to-query relevance
1Miller et. al.‘99
query

 MDP extends MC with actions and rewards1
si– state ai – action ri – reward
pi – transition probability
p0 p1 p2
Markov Decision Process
……s0 s1
r0
a0
s2
r1
a1
s3
r2
a2
1R. Bellman,‘57
(S, M, A, R, γ)

Definition of MDP
 A tuple (S, M, A, R, γ)
 S : state space
 M: transition matrix
Ma(s, s') = P(s'|s, a)
 A: action space
 R: reward function
R(s,a) = immediate reward taking action a at state s
 γ: discount factor, 0< γ ≤1
 policy π
π(s) = the action taken at state s
 Goal is to find an optimal policy π* maximizing the expected
total rewards.

Policy
Policy: (s) = a
According to which,
select an action a at
state s.
(s0) =move right and ups0
(s1) =move right and ups1
(s2) = move rights2
Dynamic Information Retrieval ModelingTutorial 201467 [Slide altered from Carlos Guestrin’s ML lecture]

Value of Policy
Value:V(s)
Expected long-term
reward starting from s
Start from s0
s0
R(s0)
(s0)
V(s0) = E[R(s0) +  R(s1) + 2 R(s2) + 3 R(s3)
+ 4 R(s4) + ]
Future rewards
discounted by   [0,1)

Value of Policy
Value:V(s)
Expected long-term
Start from s0
s0
R(s0)
(s0)
V(s0) = E[R(s0) +  R(s1) + 2 R(s2) + 3 R(s3)
+ 4 R(s4) + ]
Future rewards
s1
R(s1)
s1’’
s1’
R(s1’)
R(s1’’)

Value of Policy
Value:V(s)
Expected long-term
Start from s0
s0
R(s0)
(s0)
V(s0) = E[R(s0) +  R(s1) + 2 R(s2) + 3 R(s3)
+ 4 R(s4) + ]
Future rewards
s1
R(s1)
s1’’
s1’
R(s1’)
R(s1’’)
(s1)
R(s2)
s2
(s1’)
(s1’’)
s2’’
s2’
R(s2’)
R(s2’’)

Computing the value of a policy
V(s0) = 𝐸 𝜋
[𝑅 𝑠0, 𝑎 + 𝛾𝑅 𝑠1, 𝑎 + 𝛾2 𝑅 𝑠2, 𝑎 + 𝛾3 𝑅 𝑠3, 𝑎 + ⋯ ]
=𝐸 𝜋[𝑅 𝑠0, 𝑎 + 𝛾 𝛾 𝑡−1 𝑅(𝑠𝑡, 𝑎)∞
𝑡=1 ]
=𝑅 𝑠0, 𝑎 + 𝛾𝐸 𝜋
[ 𝛾 𝑡−1
𝑅(𝑠𝑡, 𝑎)∞
𝑡=1 ]
=𝑅 𝑠0, 𝑎 + 𝛾 𝑀 𝜋 𝑠 (𝑠, 𝑠′) 𝑉(𝑠′)𝑠′
Value function
A possible next state
The current
state

Optimality — Bellman Equation
 The Bellman equation1 to MDP is a recursive definition of
the optimal value function V*(.)
𝑉∗ s = max
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉∗(𝑠′)
𝑠′
 Optimal Policy
π∗ s = arg 𝑚𝑎𝑥
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎 𝑠, 𝑠′ 𝑉∗(𝑠′)
𝑠′
1R. Bellman,‘57
state-value function

Optimality — Bellman Equation
 The Bellman equation can be rewritten as
𝑉∗ 𝑠 = max
a
𝑄(𝑠, 𝑎)
𝑄(𝑠, 𝑎) = 𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉∗(𝑠′)
𝑠′
 Optimal Policy
π∗ s = arg 𝑚𝑎𝑥
𝑎
𝑄 𝑠, 𝑎
action-value function
Relationship
betweenV and Q

MDP algorithms
 Value Iteration
 Policy Iteration
 Modified Policy Iteration
 Prioritized Sweeping
 Temporal Difference (TD) Learning
 Q-Learning
Model free
approaches
Model-based
approaches
[Bellman, ’57, Howard,‘60, Puterman and Shin,‘78, Singh & Sutton,‘96, Sutton & Barto,‘98,
Richard Sutton,‘88,Watkins,‘92]
Solve Bellman
equation
Optimal
valueV*(s)
Optimal
policy *(s)
[Slide altered from Carlos Guestrin’s ML lecture]

Value Iteration
 Initialization
Initialize 𝑉0 𝑠 arbitrarily
 Loop
 Iteration
𝑉𝑖+1 𝑠 ← max
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)𝑠′
π s ← arg 𝑚𝑎𝑥
𝑎
 Stopping criteria
 π s is good enough
1Bellman,‘57

Greedy Value Iteration
 Initialization
Initialize 𝑉0 𝑠 arbitrarily
 Iteration
𝑉𝑖+1 𝑠 ← max
𝑎
 Stopping criteria
∀𝑠 𝑉𝑖+1 𝑠 − 𝑉𝑖 𝑠 < ε
 Optimal policy
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉𝑖(𝑠′)
𝑠′
1Bellman,‘57

1. For each state s∈S
Initialize V0(s) arbitrarily
End for
2. 𝑖 ← 0
3. Repeat
3.1 𝑖 ← 𝑖 + 1
3.2 For each 𝑠 ∈ 𝑆
𝑉𝑖 𝑠 ← max
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉𝑖−1(𝑠′)𝑠′
end for
until ∀𝑠 𝑉𝑖 𝑠 − 𝑉𝑖−1 𝑠 < ε
4. For each 𝑠 ∈ 𝑆
𝑎
end for
Algorithm

V(0)(S1)=max{R(S1,a1), R(S1,a2)}=6
V(1)(S1)=max{ 3+0.96*(0.3*6+0.7*4), 6+0.96*(1.0*8) }
=max{3+0.96*4.6, 6+0.96*8.0}
=max{7.416, 13.68}
=13.68
𝑉 s = max
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉(𝑠′)
𝑠′
V(0)(S2)=max{R(S2,a1), R(S2,a2)}=4
V(0)(S3)=max{R(S3,a1), R(S3,a2)}=8
Ma1
=
0.3 0.7 0
1.0 0 0
0.8 0.2 0
Ma2
=
0 0 1.0
0 0.2 0.8
0 1.0 0
a1 a2

𝑉 s = max
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉(𝑠′)
𝑠′
i V(i)(S1) V(i)(S2) V(i)(S3)
0 6 4 8
1 13.680 9.760 13.376
2 18.841 17.133 20.380
3 25.565 22.087 25.759
… … … …
200 168.039 165.316 168.793
Ma1
=
0.3 0.7 0
1.0 0 0
0.8 0.2 0
Ma2
=
0 0 1.0
0 0.2 0.8
0 1.0 0
a1a2 a1
π S1 π S 𝟐 π S 𝟑
a2 a1 a1

Policy Iteration
 Initialization
𝑉π0
𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦
 Iteration (over i )
 Policy Evaluation
𝑉π 𝑖
𝑠
∞
← 𝑅 𝑠, π𝑖 s + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉π 𝑖
(𝑠′)
𝑠′
 Policy Improvement
π𝑖+1 s ← arg 𝑚𝑎𝑥
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′)𝑉π 𝑖
(𝑠′)𝑠′
 Stop criteria
Policy stops changing
1Howard ,‘60

Policy Iteration
1.For each state s∈S
𝑉 𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 , 𝑖 ← 0
End for
2. Repeat
2.1 Repeat
For each 𝑠 ∈ 𝑆
𝑉′(𝑠) ← 𝑉(𝑠)
𝑉 𝑠 ← 𝑅 𝑠, π𝑖 s + 𝛾 𝑀 𝑎 𝑠, 𝑠′ 𝑉(𝑠′)𝑠′
End for
until ∀𝑠 𝑉 𝑠 − 𝑉′ 𝑠 < ε
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎 𝑠, 𝑠′
𝑉(𝑠′)
𝑠′
End for
2.3 𝑖 ← 𝑖 + 1
Until π𝑖 = π𝑖−1
Algorithm

Modified Policy Iteration
 The “Policy Evaluation” step in Policy Iteration is time-
consuming, especially when the state space is large.
 The Modified Policy Iteration calculates an approximated
policy evaluation by running just a few iterations
Modified Policy
Iteration
Policy Iteration
GreedyValue Iterationk=1
k=∞

Modified Policy Iteration
1.For each state s∈S
𝑉 𝑠 ←0, π0 s ← 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 𝑝𝑜𝑙𝑖𝑐𝑦 , 𝑖 ← 0
End for
2. Repeat
2.1 Repeat k times
For each 𝑠 ∈ 𝑆
𝑉 𝑠 ← 𝑅 𝑠, π𝑖 s + 𝛾 𝑀 𝑎 𝑠, 𝑠′ 𝑉(𝑠′)𝑠′
End for
𝑎
𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎 𝑠, 𝑠′ 𝑉(𝑠′)
𝑠′
End for
2.3 𝑖 ← 𝑖 + 1
Until π𝑖 = π𝑖−1
Algorithm

MDP algorithms
 Value Iteration
 Policy Iteration
 Modified Policy Iteration
 Prioritized Sweeping
 Temporal Difference (TD) Learning
 Q-Learning
Model free
approaches
Model-based
approaches
[Bellman, ’57, Howard,‘60, Puterman and Shin,‘78, Singh & Sutton,‘96, Sutton & Barto,‘98,
Richard Sutton,‘88,Watkins,‘92]
Solve Bellman
equation
Optimal
valueV*(s)
Optimal
policy *(s)
[Slide altered from Carlos Guestrin’s ML lecture]

Temporal Difference Learning
 Monte Carlo Sampling can be used for model-free policy iteration
 Estimate 𝑉 𝜋 s in “Policy Evaluation” by the average reward of trajectories from s
 However, on the trajectories, some of them can be reused
 So, we estimate them by an expectation over next state
𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝑟 + γ𝐸 𝑉 𝜋 𝑠′
|𝑠, 𝑎
 The simplest estimation:
𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝑟 + 𝛾𝑉 𝜋 s′
 A smoothed version:
𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉 𝜋 s′
+ (1 − 𝛼) 𝑉 𝜋 𝑠
 TD-Learning rule:
𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉 𝜋 𝑠′ − 𝑉 𝜋(𝑠)
 r is the immediate reward, α is the learning rate
Temporal difference
Richard Sutton,‘88
Singh & Sutton,‘96
Sutton & Barto,‘98

1. For each state s∈S
Initialize V 𝜋(s) arbitrarily
End for
2. For each step in the state sequence
2.1 Initialize s
2.2 repeat
2.2.1 take action a at state s according to 𝜋
2.2.2 observe immediate reward r and the next state 𝑠′
2.2.3 𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉 𝜋 𝑠′
− 𝑉 𝜋(𝑠)
2.2.4 𝑠 ← 𝑠′
Until s is a terminal state
End for
Algorithm
Temporal Difference Learning

Q-Learning
 TD-Learning rule
 Q-learning rule
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max
𝑎′
𝑄 𝑠′, 𝑎′ − 𝑄(𝑠, 𝑎)
𝑉 𝜋 s ← 𝑉 𝜋 𝑠 + 𝛼 𝑟 + 𝛾𝑉 𝜋 𝑠′ − 𝑉 𝜋(𝑠)
𝑉 𝑠 = max
a
𝑄(𝑠, 𝑎)
𝜋∗
𝑠 = arg 𝑚𝑎𝑥
𝑎
𝑄∗
(𝑠, 𝑎)
𝑄∗
𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑀 𝑎(𝑠, 𝑠′) max
𝑎′
𝑄∗
(𝑠′
, 𝑎′)
𝑠′

Q-Learning
1. For each state s∈S and a∈A
initialize Q0(s,a) arbitrarily
End for
2. 𝑖 ← 0
3. For each step in the state sequence
3.1 Initialize s
3.2 Repeat
3.2.1 𝑖 ← 𝑖 + 1
3.2.2 select an action a at state s according to Qi-1
3.2.3 take action a, observe immediate reward r and the next state 𝑠′
3.2.4 𝑄𝑖 𝑠, 𝑎 ← 𝑄𝑖−1 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max
𝑎′
𝑄𝑖−1 𝑠′
, 𝑎′
− 𝑄𝑖−1(𝑠, 𝑎)
3.2.5 𝑠 ← 𝑠′
Until s is a terminal state
End for
4. For each 𝑠 ∈ 𝑆
𝑎
𝑄𝑖 𝑠, 𝑎
End for
Algorithm

Apply an MDP to an IR Problem
 We can model IR systems using a Markov Decision
Process
 Is there a temporal component?
 States –What changes with each time step?
 Actions – How does your system change the state?
 Rewards – How do you measure feedback or
effectiveness in your problem at each time step?
 Transition Probability – Can you determine this?
 If not, then model free approach is more suitable

Apply an MDP to an IR Problem -
Example
 User agent in session search
 States – user’s relevance judgement
 Action – new query
 Reward – information gained

Apply an MDP to an IR Problem -
Example
 Search engine’s perspective
 What if we can’t directly observe user’s relevance
judgement?
 Click ≠ relevance
? ? ? ?

 Markov Chain
 Multi-armed Bandit

POMDP Model
……s0 s1
r0
a0
s2
r1
a1
s3
r2
a2
 Hidden states
 Observations
 Belief
1R. D. Smallwood et. al.,‘73
o1 o2 o3

POMDP Definition
 A tuple (S, M,A, R, γ, O, Θ, B)
 S : state space
 M: transition matrix
 A: action space
 R: reward function
 γ: discount factor, 0< γ ≤1
 O: observation set
an observation is a symbol emitted according to a hidden state.
 Θ: observation function
Θ(s,a,o) is the probability that o is observed when the system transitions
into state s after taking action a, i.e. P(o|s,a).
 B: belief space
Belief is a probability distribution over hidden states.

 The agent uses a state estimator to update its belief about the
hidden states
b′
= 𝑆𝐸(𝑏, 𝑎, 𝑜′)
 b′
s′
= P s′
o′
, a, b =
𝑃(𝑠′,𝑜′|𝑎,𝑏)
P(𝑜′|𝑎,𝑏)
=
Θ(𝑠′, 𝑎, 𝑜′) 𝑀(𝑠, 𝑎, 𝑠′)𝑏(𝑠)𝑠
𝑃(𝑜′|𝑎, 𝑏)
POMDP → Belief Update

 The Bellman equation for POMDP
𝑉 𝑏 = max
𝑎
𝑟 𝑏, 𝑎 + 𝛾 𝑃(𝑜′|𝑎, 𝑏)𝑉(𝑏′)
𝑜′
 A POMDP can be transformed into a continuous belief MDP (B, 𝑀′, A, r, γ)
 B : the continuous belief space
 𝑀′: transition function 𝑀 𝑎
′ (𝑏, 𝑏′)= 1 𝑎,𝑜′(𝑏′, 𝑏)Pr(𝑜′|𝑎, 𝑏)𝑜∈𝑂
where 1 𝑎,𝑜′ 𝑏′
, 𝑏 =
1, 𝑖𝑓 𝑆𝐸 𝑏, 𝑎, 𝑜′ = 𝑏′
0, 𝑒𝑙𝑠𝑒
.
 A: action space
 r: reward function r(b, a)= 𝑏 𝑠 𝑅(𝑠, 𝑎)𝑠∈𝑆
POMDP → Bellman Equation

The optimal policy of a POMDP
The optimal policy of its belief MDP
1L. Kaelbling et. al., ’98
A variation of the value iteration algorithm
Solving POMDPs – The Witness
Algorithm

Policy Tree
• A policy tree of depth i is an i-step non-stationary policy
• As if we run value iteration until the ith iteration
a(h)
ok(h) ok
a11
a21
a2k a2l
… …
…
…
…
… … … … … …
o1 ol
…aik
…
a(i-1)k
ai1
ail
o1 olok
i steps to go
i-1 steps to go
2 steps to go
1 step to go

Value of a Policy Tree
 Can only determine the value of a policy tree h from some belief state
b, because it never knows the exact state.
𝑉ℎ 𝑏 = 𝑏(𝑠)𝑉ℎ(𝑠)𝑠∈𝑆
 𝑉ℎ 𝑠 = 𝑅 𝑠, 𝑎 ℎ + 𝛾 𝑀 𝑎 ℎ (𝑠, 𝑠′) Θ(𝑠′, 𝑎 ℎ , 𝑜𝑖)𝑉𝑜 𝑘 ℎ (𝑠′)𝑜 𝑘∈𝑂𝑠′∈𝑆
the action at the
root node of h
the (i-1)-step subtree associated
with ok under the root node of h

Idea of the Witness Algorithm
 For each action a, compute Γ𝑖
𝑎
, the set of candidate i-step policy
trees with action a at their roots
 The optimal value function at the ith step, 𝑉𝑖
∗
(b), is the upper
surface of the value functions of all i-step policy trees.

Optimal value function
 Geometrically, 𝑉𝑖
∗
(b) is piecewise linear and convex.
An example for a two-state POMDP
b(s1)+b(s2)=1
Simplex constraint
The belief space is one-dimensional
Vh2(b)
Vh3(b)
Vh1(b)
Vh5(b)
Vh4(b)
𝑉𝑖
∗
𝑏 = max
ℎ∈H
𝑉ℎ 𝑏
Pruning the Set of
PolicyTrees

Outlines of the Witness Algorithm
Algorithm
1.𝐻1 ←{}
2. i ← 1
3. Repeat
3.1 i ← i+1
3.2 For each a in A
Γ𝑖
𝑎
← witness(𝐻i−1, a)
end for
3.3 Prune Γ𝑖
𝑎
𝑎 to get 𝐻i
until 𝑠𝑢𝑝 𝑏|Vi(b) − Vi−1(b)| < 𝜀
the inner loop

Inner Loop of the Witness
Algorithm
Inner loop of the witness algorithm
1. Select a belief b arbitrarily. Generate a best i-step policy tree hi. Add
ℎi to an agenda.
2. In each iteration
2.1 Select a policy tree ℎ 𝑛𝑒𝑤 from the agenda.
2.2 Look for a witness point b using Za and ℎ 𝑛𝑒𝑤.
2.3 If find such a witness point b,
2.3.1 Calculate the best policy tree ℎ 𝑏𝑒𝑠𝑡 for b.
2.3.2 Add ℎ 𝑏𝑒𝑠𝑡 to Za.
2.3.3 Add all the alternative trees of ℎ 𝑏𝑒𝑠𝑡 to the agenda.
2.4 Else remove ℎ 𝑛𝑒𝑤 from the agenda.
3. Repeat the above iteration until the agenda is empty.

Other Solutions
 QMDP1
 MC-POMDP (Monte Carlo POMDP)2
 Grid BasedApproximation3
 Belief Compression4
……
1 Thrun et. al.,‘06
2 Thrun et. al.,‘05
3 Lovejoy,‘91
4 Roy,‘03

POMDP Dynamic IR
Environment Documents
Agents User, Search engine
States Queries, User’s decision making status, Relevance of
documents, etc
Actions Provide a ranking of documents, Weigh terms in the query,
Add/remove/unchange the query terms, Switch on or
switch off a search technology, Adjust parameters for a
search technology
Observations Queries, Clicks, Document lists, Snippets, Terms, etc
Rewards Evaluation measures (such as DCG, NDCG or MAP)
Clicking information
Transition matrix Given in advance or estimated from training data.
Observation
function
Problem dependent, Estimated based on sample datasets
Applying POMDP to Dynamic IR

Session Search Example - States
SRT
Relevant &
Exploitation
SRR
Relevant &
Exploration
SNRT
Non-Relevant &
Exploitation
SNRR
Non-Relevant &
Exploration
 scooter price ⟶ scooter stores  Hartford visitors ⟶ Hartford
Connecticut tourism
 Philadelphia NYC travel ⟶
Philadelphia NYC train
 distance NewYork Boston ⟶
maps.bing.com
q0
106 [ J. Luo ,et al., ’14]

Session Search Example - Actions
(Au, Ase)
 User Action(Au)
 Add query terms (+Δq)
 Remove query terms (-Δq)
 keep query terms (qtheme)
 clicked documents
 SAT clicked documents
 Search Engine Action(Ase)
 increase/decrease/keep term weights,
 Switch on or switch off query expansion
 Adjust the number of top documents used in PRF
 etc.
107 [ J. Luo et al., ’14]

Multi Page Search Example -
States & Actions
State:
Relevance
of
document
Action:
Ranking of
documents
Observation:
Clicks
Belief: Multivariate
Guassian
Reward: DCG over 2
pages
[Xiaoran Jin et. al., ’13]

SIGIRTutorial July 7th 2014
Grace Hui Yang
Marc Sloan
JunWang
Guest Speaker: EmineYilmaz
Modeling
Exercise

 Markov Chain
 Multi-Armed Bandit

Multi Armed Bandits (MAB)
……
……
Which slot
machine should
I select in this
round?
Reward

Multi Armed Bandits (MAB)
I won! Is this
the best slot
machine?
Reward

MAB Definition
 A tuple (S,A, R, B)
S : hidden reward distribution of each bandit
A: choose which bandit to play
R: reward for playing bandit
B: belief space, our estimate of each bandit’s
distribution

Comparison with Markov Models
 Single state Markov Decision Process
No transition probability
 Similar to POMDP in that we maintain a belief
state
 Action = choose a bandit, does not affect state
 Does not‘plan ahead’ but intelligently adapts
 Somewhere between interactive and dynamic IR

Markov Multi Armed Bandits
……
……
Markov
Process 1
Markov
Process 2
Markov
Process k
Which slot
machine should
I select in this
round?
Reward

Markov Multi Armed Bandits
……
……
Markov
Process 1
Markov
Process 2
Markov
Process k
Markov
Process
Action
Which slot
machine should
I select in this
round?
Reward

MAB Policy Reward
 MAB algorithm describes a policy 𝜋 for choosing
bandits
 Maximise rewards from chosen bandits over all
time steps
 Minimize regret
 𝑅𝑒𝑤𝑎𝑟𝑑 𝑎∗ − 𝑅𝑒𝑤𝑎𝑟𝑑(𝑎 𝜋(𝑡))𝑇
𝑡=1
 Cumulative difference between optimal reward and
actual reward

Exploration vs Exploitation
 Exploration
 Try out bandits to find which has highest average reward
 Exploitation
 Too much exploration leads to poor performance
 Play bandits that are known to pay out higher reward on average
 MAB algorithms balance exploration and exploitation
 Start by exploring more to find best bandits
 Exploit more as best bandits become known

Exploration vs Exploitation

MAB – Index Algorithms
 Gittens index1
 Play bandit with highest‘Dynamic Allocation Index’
 Modelled using MDP but suffers‘curse of dimensionality’
 𝜖-greedy2
 Play highest reward bandit with probability 1 − ϵ
 Play random bandit with probability 𝜖
 UCB (Upper Confidence Bound)3
 Play bandit 𝑖 with highest 𝑥𝑖 +
2 ln 𝑡
𝑇 𝑖
 Chances of playing infrequently played bandits increases over
time
1J. C. Gittins.‘89
2Nicolò Cesa-Bianchi et. al.,‘98
3P.Auer et. al.,‘02

MAB use in IR
 Choosing ads to display to users1
 Each ad is a bandit
 User click through rate is reward
 Recommending news articles2
 News article is a bandit
 Similar to Information Filtering case
 Diversifying search results3
 Each rank position is an MAB dependent on higher ranks
 Documents are bandits chosen by each rank
1Deepayan Chakrabarti et. al. ,‘09
2Lihong Li et. al., ’10
3Radlinski et. al.,‘08

MAB Variations
 Contextual Bandits1
 World has some context 𝑥 ∈ 𝑋 (i.e. user location)
 Learn policy 𝜋: 𝑋 → 𝐴 that maps context to arms (online or
offline)
 Duelling Bandits2
 Play two (or more) bandits at each time step
 Observe relative reward rather than absolute
 Learn order of bandits
 Mortal Bandits3
 Value of bandits decays over time
 Exploitation > exploration
1Lihong Li et. al.,‘10
2YisongYue et. al.,‘09
3Deepayan Chakrabarti et. al. ,‘09

Comparison of Markov Models
 MC – a fully observable stochastic process
 HMM – a partially observable stochastic process
 MDP – a fully observable decision process
 MAB – a decision process, either fully or partially observable
 POMDP – a partially observable decision process
actions rewards states
MC No No Observable
HMM No No Unobservable
MDP Yes Yes Observable
POMDP Yes Yes Unobservable
MAB Yes Yes Fixed

Outline
 Introduction
 Session Search
 Reranking

TREC Session Tracks (2010-2012)
 Given a series of queries {q1,q2,…,qn}, top 10 retrieval
results {D1, … Di-1 } for q1 to qi-1, and click information
 The task is to retrieve a list of documents for the current/last
query, qn
 Relevance judgment is made based on how relevant the
documents are for qn, and how relevant they are for information
needs for the entire session (in topic description)
 no need to segment the sessions
126

1.pocono mountains pennsylvania
2.pocono mountains pennsylvania hotels
3.pocono mountains pennsylvania things to do
4.pocono mountains pennsylvania hotels
5.pocono mountains camelbeach
6.pocono mountains camelbeach hotel
7.pocono mountains chateau resort
8.pocono mountains chateau resort attractions
9.pocono mountains chateau resort getting to
10.chateau resort getting to
11.pocono mountains chateau resort directions
TREC 2012 Session 6
127
Information needs:
You are planning a winter vacation to the
Pocono Mountains region in Pennsylvania in
the US.Where will you stay?What will you
do while there? How will you get there?
In a session, queries change
constantly

Query change is an important
form of feedback
 We define query change as the syntactic editing changes
between two adjacent queries:
 includes
 , added terms
 , removed terms
 The unchanged/shared terms are called:
 , theme term
1 iii qqq
iq
128
iq
iq
iq
themeq
q1 = “bollywood legislation”
q2 = “bollywood law”
---------------------------------------
ThemeTerm = “bollywood”
Added (+Δq) = “law”
Removed (-Δq) = “legislation”

Where do these query changes come
from?
 GivenTREC Session settings, we consider two sources of
query change:
 the previous search results that a user viewed/read/examined
 the information need
 Example:
 Kurosawa  Kurosawa wife
 `wife’ is not in any previous results, but in the topic description
 However, knowing information needs before search is
difficult to achieve
129

Previous search results could influence
query change in quite complex ways
 Merck lobbyists  Merck lobbying US policy
 D1 contains several mentions of‘policy’, such as
 “A lobbyist who until 2004 worked as senior policy advisor to
Canadian Prime Minister Stephen Harper was hired last month by
Merck …”
 These mentions are about Canadian policies; while the user adds
US policy in q2
 Our guess is that the user might be inspired by‘policy’, but
he/she prefers a different sub-concept other than `Canadian
policy’
 Therefore, for the added terms `US policy’,‘US’ is the novel term
here, and‘policy’ is not since it appeared in D1.
 The two terms should be treated differently
130

 We propose to model session search as a Markov decision process (MDP)
 Two agents: the User and the Search Engine
 Environments
Search results
 States Queries
 Actions
 User actions:
Add/remove/unchange
the query terms
 Search Engine actions:
Increase/ decrease
/remain term weights
Applying MDP to Session Search

Search Engine Agent’s Actions
∈ Di−1 action Example
qtheme
Y increase “pocono mountain” in s6
N increase
“france world cup 98 reaction” in s28,
france world cup 98 reaction stock
market→ france world cup 98 reaction
+∆q
Y decrease
‘policy’ in s37, Merck lobbyists → Merck
lobbyists US policy
N increase
‘US’ in s37, Merck lobbyists → Merck lobbyists
US policy
−∆q
Y decrease
‘reaction’ in s28, france world cup 98
reaction
→ france world cup 98
N
No
change
‘legislation’ in s32, bollywood legislation
→bollywood law
132

Query Change retrieval Model
(QCM)
 Bellman Equation gives the optimal value for an MDP:
 The reward function is used as the document relevance score
function and is tweaked backwards from Bellman equation:
133
V*
(s) = max
a
R(s,a) + g P(s' | s,a)
s'
å V*
(s')
 
a
Di
)D|(qPmaxa),D,q|(qP+d)|(qP=d),Score(q 1-i1-i1-i1-iiii
1

Document
relevant score Query
Transition
model
Maximum
past
relevanceCurrent
reward/relevanc
e score

Calculating the Transition Model
)|(log)|(
)|(log)()|(log)|(
)|(log)]|(1[+d)|P(qlog=d),Score(q
*
1
*
1
*
1ii
*
1
*
1
dtPdtP
dtPtidfdtPdtP
dtPdtP
qt
i
dt
qt
dt
qt
i
qthemet
i
ii



















134
• According to Query Change and Search Engine
Actions
Current reward/
relevance score
Increase weights
for theme terms
Decrease weights
for removed terms
Increase weights
for novel added
terms
Decrease weights
for old added
terms

Maximizing the Reward Function
 Generate a maximum rewarded document denoted as d*
i-1,
from Di-1
 That is the document(s) most relevant to qi-1
 The relevance score can be calculated as
𝑃 𝑞𝑖−1 𝑑𝑖−1 = 1 − {1 − 𝑃(𝑡|𝑑𝑖−1)}
𝑡∈𝑞 𝑖−1
𝑃 𝑡 𝑑𝑖−1 =
#(𝑡,𝑑 𝑖−1)
|𝑑 𝑖−1|
 From several options, we choose to only use the document
with top relevance
max
Di-1
P(qi-1 | Di-1)
135

Scoring the Entire Session
 The overall relevance score for a session of queries is
aggregated recursively :
Scoresession (qn, d) = Score(qn, d) + gScoresession (qn-1, d)
= Score(qn, d) + g[Score(qn-1, d) + gScoresession (qn-2, d)]
= gn-i
i=1
n
å Score(qi, d)
136

Experiments
 TREC 2011-2012 query sets, datasets
 ClubWeb09 Category B
137

Search Accuracy (TREC 2012)
 nDCG@10 (official metric used inTREC)
Approach nDCG@10 %chg MAP %chg
Lemur 0.2474 -21.54% 0.1274 -18.28%
TREC’12 median 0.2608 -17.29% 0.1440 -7.63%
Our TREC’12
submission
0.3021 −4.19% 0.1490 -4.43%
TREC’12 best 0.3221 0.00% 0.1559 0.00%
QCM 0.3353 4.10%† 0.1529 -1.92%
QCM+Dup 0.3368 4.56%† 0.1537 -1.41%
138

Search Accuracy (TREC 2011)
 nDCG@10 (official metric used inTREC)
Approach nDCG@10 %chg MAP %chg
Lemur 0.3378 -23.38% 0.1118 -25.86%
TREC’11 median 0.3544 -19.62% 0.1143 -24.20%
TREC’11 best 0.4409 0.00% 0.1508 0.00%
QCM 0.4728 7.24%† 0.1713 13.59%†
QCM+Dup 0.4821 9.34%† 0.1714 13.66%†
Our TREC’12
submission
0.4836 9.68%† 0.1724 14.32%†
139

Search Accuracy for Different
Session Types
 TREC 2012 Sessions are classified into:
 Product: Factual / Intellectual
 Goal quality: Specific / Amorphous
Intellec
tual
%chg Amorphous %chg Specific %chg Factual %chg
TREC best 0.3369 0.00% 0.3495 0.00% 0.3007 0.00% 0.3138 0.00%
Nugget 0.3305 -1.90% 0.3397 -2.80% 0.2736 -9.01% 0.2871 -8.51%
QCM 0.3870 14.87% 0.3689 5.55% 0.3091 2.79% 0.3066 -2.29%
QCM+DUP 0.3900 15.76% 0.3692 5.64% 0.3114 3.56% 0.3072 -2.10%
140
- Better handle sessions that demonstrate evolution and exploration
Because QCM treats a session as a continuous process by studying
changes among query transitions and modeling the dynamics

Outline
 Introduction
 Session Search
 Reranking

Multi Page Search

Multi Page Search
Page 1 Page 2
2.
1.
2.
1.

Relevance Feedback
 No UI Changes
 Interactivity is Hidden
 Private, performed in browser

Relevance Feedback
Page 1
• Diverse Ranking
• Maximise
learning
potential
• Exploration vs
Exploitation
Page 2
• Clickthroughs or
explicit ratings
• Respond to
feedback from
page 1
• Personalized

Model

Model
 𝑁 𝜃1, Σ1
 𝜃1 -prior estimate of relevance
 Σ1
- prior estimate of covariance
 Document similarity
 Topic Clustering

Model
 Rank action for page 1

Model

Model
 Feedback from page 1
 𝒓 ~ 𝑁(𝜃𝒔
1
, Σ 𝒔
1
)

Model
 Update estimates using 𝒓1
 𝜃1
=
𝜃𝒔′
𝜃 𝒔′
Σ1
=
Σ𝒔′ Σs′𝒔′
Σs′𝒔′ Σ 𝒔′
 𝜃2
= 𝜃𝒔′ + Σs′𝒔′Σ 𝒔′
−1
(𝒓1
− 𝜃𝒔′)
 Σ2 = Σ𝒔′ - Σs′𝒔′Σ 𝒔′
−1
Σs′𝒔′

Model
 Rank using PRP

Model
 Utility or Ranking
 𝜆
𝜃 𝑠 𝑗
1
log2(𝑗+1)
+ 1 − 𝜆
𝜃 𝑠 𝑗
2
log2(𝑗+1)
2𝑀
𝑗=1+𝑀
𝑀
𝑗=1
 DCG

Model – Bellman Equation
 Optimize 𝒔1 to improve 𝑼 𝒔
2
 𝑉 𝜃1
, Σ1
, 1 =
max
𝒔1
𝜆𝜃𝒔
1
. 𝑾1 + max
𝒔2
(1 − 𝜆) 𝜃𝒔
2
. 𝑾2 𝑃 𝒓 𝑑𝒓𝒓

𝜆
 Balances exploration and exploitation in page 1
 Tuned for different queries
 Navigational
 Informational
 𝜆 = 1 for non-ambiguous search

Approximation
 Monte Carlo Sampling
 ≈ max
𝒔1
𝜆𝜃𝒔
1
. 𝑾1 + max
𝒔2
1 − 𝜆
1
𝑆
𝜃𝒔
2
. 𝑾2 𝑃 𝒓𝑟∈𝑂
 Sequential Ranking Decision

Experiment Data
 Difficult to evaluate without access to live users
 Simulated using 3TREC collections and relevance
judgements
 WT10G – Explicit Ratings
 TREC8 – Clickthroughs
 Robust – Difficult (ambiguous) search

User Simulation
 Rank M documents
 Simulated user clicks according to relevance judgements
 Update page 2 ranking
 Measure at page 1 and 2
 Recall
 Precision
 nDCG
 MRR
 BM25 – prior ranking model

Investigating λ

Baselines
 𝜆 determined experimentally
 BM25
 BM25 with conditional update (𝜆 = 1)
 Maximum Marginal Relevance (MMR)
 Diversification
 MMR with conditional update
 Rocchio
 Relevance Feedback

Results

Results
 Similar results across data sets and metrics
 2nd page gain outweighs 1st page losses
 Outperformed Maximum Marginal Relevance using MRR to
measure diversity
 BM25-U simply no exploration case
 Similar results when 𝑀 = 5

Results

Outline
 Introduction
 Session Search
 Reranking

Evaluation
EmineYilmaz
University College London
Emine.Yilmaz@ucl.ac.uk

Information Retrieval Systems
Match information seekers with
the information they seek

Retrieval Evaluation: Traditional
View

Retrieval Evaluation: Dynamic
View

Different Approaches to
Evaluation
 Online Evaluation
 Design interactive experiments
 Use users’ actions to evaluate the quality
 Inherently dynamic in nature
 Offline Evaluation
 Controlled laboratory experiments
 The users’ interaction with the engine is only simulated
 Recent work focused on dynamic IR evaluation

Online Evaluation
 Standard click metrics
 Clickthrough rate
 Probability user skips over results they have considered (pSkip)
 Most recently: Result interleaving



Click/Noclick
Evaluate
175

What is result interleaving?
 A way to compare rankers online
 Given the two rankings produced by two methods
 Present a combination of the rankings to users
 Team Draft Interleaving (Radlinski et al., 2008)
 Interleaving two rankings
 Input:Two rankings (“can be seen as teams who pick players”)
 Repeat:
o Toss a coin to see which team (ranking) picks next
o Winner picks their best remaining player (document)
o Loser picks their best remaining player (document)
 Output: One ranking (2 teams of 5)
 Credit assignment
 Ranking providing more of the clicked results wins

Team Draft InterleavingRanking A
1. Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Valley Wineries - Plan your wine...
www.napavalley.com/wineries
3. Napa Valley College
www.napavalley.edu/homex.asp
4. Been There | Tips | Napa Valley
www.ivebeenthere.co.uk/tips/16681
5. Napa Valley Wineries and Wine
www.napavintners.com
6. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/Napa_Valley
Ranking B
www.napavalley.com
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
5. NapaValley.org
www.napavalley.org
6. The Napa Valley Marathon
www.napavalleymarathon.org
Presented Ranking
www.napavalley.com
4. Napa Valley Wineries – Plan your wine...
www.napalinks.com
7 NapaValley.org
www.napavalley.org
AB

www.napavalley.com
Ranking B
www.napavalley.com
www.napalinks.com
5. NapaValley.org
www.napavalley.org
Presented Ranking
www.napavalley.com
www.napalinks.com
7 NapaValley.org
www.napavalley.org
B wins!

www.napavalley.com
Ranking B
www.napavalley.com
www.napalinks.com
5. NapaValley.org
www.napavalley.org
Presented Ranking
www.napavalley.com
www.napalinks.com
7 NapaValley.org
www.napavalley.org
B wins!
Repeat Over Many Different
Queries!

Offline Evaluation
 Controlled laboratory experiments
 The user’s interaction with the engine is
only simulated
 Ask experts to judge each query result
 Predict how users behave when they search
 Aggregate judgments to evaluate
180

Offline Evaluation
 Until recently: Metrics assume that user’s information need was not affected
by the documents read
 E.g.Average Precision, NDCG, …
• Users are more likely to stop searching when they see a highly relevant
document
• Lately: Metrics that incorporate the affect of relevance of documents seen
by the user on user behavior
 Based on devising more realistic user models
 EBU, ERR [Yilmaz et al CIKM10, Chapelle et al CIKM09]
181

Modeling User Behavior
Cascade-based models
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
• The user views search results from top to bottom
• At each rank i, the user has a certain probability of being
satisfied.
• Probability of satisfaction proportional to the
relevance grade of the document at rank i.
• Once the user is satisfied with a document, he terminates
the search.

Rank Biased Precision
Query
Stop
View Next
Item
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…

Rank Biased Precision
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…



1=i
1
=utilityTotal i
irel
examineddocsm.utility/NuTotalRBP 
)1/(1)1(=examineddocsNum.
1=i
1
 

i
i
)-(1=RBP
1=i
1


i
irel

Expected Reciprocal Rank
[Chapelle et al CIKM09]
Query
Stop
Relevant?
View Next
Item
nosomewhathighly
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…

Expected Reciprocal Rank
[Chapelle et al CIKM09]
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
rrankatdocument"perfectthe"findingofUtility:(r)
1/r(r) 
)positionatstopsuser(
1
1
rP
r
ERR
n
r






1
11
)1(
1 r
i
ri
n
r
RR
r
ERR
documentitheofgraderelevance: th
ig
iRi g
g
i
i
docatstopofProb.
2
12
docofrelevanceofProb. max




Paris Luxurious HotelsParis HiltonJ LoSession Evaluation

Measuring “goodness”
The user steps down a ranked list of documents and
observes each one of them until a decision point and either
a) abandons the search, or
b) reformulates
While stepping down or sideways, the user accumulates
utility

Evaluation over a single ranked list
1
2
3
4
5
6
7
8
9
10
…
kenya cooking
traditional swahili
kenya cooking
traditional
kenya swahili
traditional food
recipes

Session DCG
[Järvelin et al ECIR 2008]
kenya cooking
traditional swahili
kenya cooking
traditional

2rel(r)
1
logb (r b 1)r1
k


2rel(r)
1
logb (r b 1)r1
k

1
logc (1 c 1)
DCG(RL1) 
1
logc (2  c 1)
 DCG(RL2)

Model-based measures
Probabilistic space of users following
different paths
 Ω is the space of all paths
 P(ω) is the prob of a user following a path ω in Ω
 Mω is a measure over a path ω
[Yang and Lad ICTIR 2009,
Kanoulas et al. SIGIR 2011]

Probability of a path
Probability of abandoning at
reform 2
X
Probability of reformulating at rank
3
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
(1)
(2)

Expected Global Utility
[Yang and Lad ICTIR 2009]
1. User steps down ranked results one-by-one
2. Stops browsing documents based on a stochastic process
that defines a stopping probability distribution over ranks
and reformulates
3. Gains something from relevant documents, accumulating
utility

Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Probability
of abandoning
the session at
reformulation i
Geometric w/ parameter preform
(1)

Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Geometricw/parameterpdown
Probability
of reformulating
at rank j
(2)
Geometric w/ parameter preform

Expected Global Utility
[Yang and Lad ICTIR 2009]
 The probability of a user following a path ω:
P(ω) = P(r1, r2, ..., rK)
ri is the stopping and reformulation point in list i
 Assumption: stopping positions in each list are independent
P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)
 Use geometric distribution (RBP) to model the stopping and
reformulation behaviour
P(ri = r) = (1-)k1

Conclusions
 Recent focus on evaluating the dynamic nature of the search
process
 Interleaving
 New offline evaluation metrics
 ERR, RBU
 Session evaluation metrics

Outline
 Introduction
 Session Search
 Reranking
 Conclusion

Conclusions
 Dynamic IR describes a new class of interactive model
 Incorporates rich feedback, temporal dependency and is goal
oriented.
 Family of Markov models and Multi Armed Bandit theory
useful in building DIR models
 Applicable to a range of IR problems
 Useful in applications such as session search and evaluation

Dynamic IR Book
 Published by Morgan & Claypool
 ‘Synthesis Lectures on Information Concepts, Retrieval, and
Services’
 Due March/April 2015 (in time for SIGIR 2015)

Acknowledgment
 We thank Dr. EmineYilmaz for giving us the guest speech.
 We sincerely thank Dr. Xuchu Dong for his help in
preparation of the tutorial
 We also thank comments and suggestions from the following
colleagues:
 Dr. Jamie Callan
 Dr. Ophir Frieder
 Dr. Fernando Diaz
 Dr Filip Radlinski

Thank You

References
Static IR
 Modern Information Retrieval. R. Baeza-Yates and B. Ribeiro-
Neto.Addison-Wesley, 1999.
 The PageRank Citation Ranking: Bringing Order to theWeb.
Lawrence Page , Sergey Brin , Rajeev Motwani ,TerryWinograd.
1999
 Implicit User Modeling for Personalized Search, Xuehua Shen et.
al, CIKM, 2005
 A Short Introduction to Learning to Rank. Hang Li, IEICE
Transactions 94-D(10): 1854-1862, 2011.

References
Interactive IR
 Relevance Feedback in Information Retrieval, Rocchio, J. J.,The
SMART Retrieval System (pp. 313-23), 1971
 A study in interface support mechanisms for interactive
information retrieval, RyenW.White et. al, JASIST, 2006
 Visualizing stages during an exploratory search session, Bill Kules
et. al, HCIR, 2011
 Dynamic Ranked Retrieval, Cristina Brandt et. al,WSDM, 2011
 Structured Learning of Two-level Dynamic Rankings, Karthik
Raman et. al, CIKM, 2011

References
Dynamic IR
 A hidden Markov model information retrieval system. D. R. H.
Miller,T. Leek, and R. M. Schwartz. In SIGIR’99, pages 214-221.
 Threshold setting and performance optimization in adaptive
ﬁltering, Stephen Robertson, JIR 2002
 A large-scale study of the evolution of web pages, Dennis Fetterly
et. al.,WWW 2003
 Learning diverse rankings with multi-armed bandits. Filip
Radlinski, Robert Kleinberg,Thorsten Joachims. ICML, 2008.
 Interactively Optimizing Information Retrieval Systems as a
Dueling Bandits Problem,YisongYue et. al., ICML 2009
 Meme-tracking and the dynamics of the news cycle, Jure Leskovec
et. al., KDD 2009

References
Dynamic IR
 Mortal multi-armed bandits. Deepayan Chakrabarti, Ravi Kumar, Filip
Radlinski, Eli Upfal. NIPS 2009
 A Novel Click Model and Its Applications to Online Advertising ,
Zeyuan Allen Zhu et. al.,WSDM 2010
 A contextual-bandit approach to personalized news article
recommendation. Lihong Li,Wei Chu, John Langford, Robert E.
Schapire.WWW, 2010
 Inferring search behaviors using partially observable markov model with
duration (POMD),Yin he et. al.,WSDM, 2011
 No Clicks, No Problem: Using Cursor Movements to Understand and
Improve Search, Jeff Huang et. al., CHI 2011
 Balancing Exploration and Exploitation in Learning to Rank Online,
Katja Hofmann et. al., ECIR, 2011
 Large-ScaleValidation and Analysis of Interleaved Search Evaluation,
Olivier Chapelle et. al.,TOIS 2012

References
Dynamic IR
 Using ControlTheory for Stable and Efficient Recommender Systems.T.
Jambor, J.Wang, N. Lathia. In:WWW '12, pages 11-20.
 Sequential selection of correlated ads by POMDPs, ShuaiYuan et. al.,
CIKM 2012
 Utilizing query change for session search. D. Guan, S. Zhang, and H.
Yang. In SIGIR ’13, pages 453–462.
 Query Change as Relevance Feedback in Session Search (short paper). S.
Zhang, D. Guan, and H.Yang. In SIGIR 2013.
 Interactive exploratory search for multi page search results. X. Jin, M.
Sloan, and J.Wang. InWWW ’13.
 Interactive Collaborative Filtering. X. Zhao,W. Zhang, J.Wang. In:
CIKM'2013, pages 1411-1420.
 Win-win search: Dual-agent stochastic game in session search. J. Luo, S.
Zhang, and H.Yang. In SIGIR ’14.

References
Markov Processes
 A markovian decision process. R. Bellman. Indiana University
Mathematics Journal, 6:679–684, 1957.
 Dynamic Programming. R. Bellman. Princeton University Press,
Princeton, NJ, USA, first edition, 1957.
 Dynamic Programming and Markov Processes. R.A. Howard. MIT Press.
1960
 Linear Programming and Sequential Decisions.Alan S. Manne.
Management Science, 1960
 Statistical Inference for Probabilistic Functions of Finite State Markov
Chains. Baum, Leonard E.; Petrie,Ted.The Annals of Mathematical
Statistics 37, 1966

References
Markov Processes
 Learning to predict by the methods of temporal differences. Richard
Sutton. Machine Learning 3. 1988
 Computationally feasible bounds for partially observed Markov decision
processes.W. Lovejoy. Operations Research 39: 162–175, 1991.
 Q-Learning. Christopher J.C.H.Watkins, Peter Dayan. Machine
Learning. 1992
 Reinforcement learning with replacing eligibility traces. Singh, S. P. &
Sutton, R. S. Machine Learning, 22, pages 123-158, 1996.
 Reinforcement Learning:An Introduction. Richard S. Sutton and
Andrew G. Barto. MIT Press, 1998.
 Planning and acting in partially observable stochastic domains. L.
Kaelbling, M. Littman, and A. Cassandra.Artificial Intelligence, 101(1-
2):99–134, 1998.

References
Markov Processes
 Finding approximate POMDP solutions through belief compression. N.
Roy. PhDThesis Carnegie Mellon. 2003
 VDCBPI: an approximate scalable algorithm for large scale POMDPs, P.
Poupart and C. Boutilier. In NIPS-2004, pages 1081–1088.
 Finding Approximate POMDP solutionsThrough Belief Compression. N.
Roy, G. Gordon and S.Thrun. Journal of Artificial Intelligence Research,
23:1-40,2005.
 Probabilistic robotics. S.Thrun,W. Burgard, D. Fox. Cambridge. MIT
Press. 2005
 Anytime Point-Based Approximations for Large POMDPs. J. Pineau, G.
Gordon and S.Thrun.Volume 27, pages 335-380, 2006
 Probabilistic Robotics. S.Thrun,W. Burgard, D. Fox.The MIT Press,
2006.

References
Markov Processes
 The optimal control of partially observable Markov decision processes over a
finite horizon. R. D. Smallwood, E.J. Sondik. Operations Research. 1973
 Modified Policy IterationAlgorithms for Discounted Markov Decision
Problems. M. L. Puterman and Shin M. C. Management Science 24, 1978.
 An example of statistical investigation of the text eugene onegin the connection
of samples in chains.A.A. Markov. Science in Context, 19:591–600, 12 2006.
 Learning to Rank for Information Retrieval.Tie-Yan Liu. Springer Science &
Business Media. 2011
 Finite-Time Regret Bounds for the Multiarmed Bandit Problem, Nicolò Cesa-
Bianchi, Paul Fischer. ICML 100-108, 1998
 Multi-armed bandit allocation indices,Wiley, J. C. Gittins. 1989
 Finite-time Analysis of the Multiarmed Bandit Problem, PeterAuer et. al.,
Machine Learning 47, Issue 2-3. 2002.

Dynamic Information Retrieval Modeling Tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dynamic Information Retrieval Modeling Tutorial

Similar to Dynamic Information Retrieval Modeling Tutorial (20)

Recently uploaded

Recently uploaded (20)

Dynamic Information Retrieval Modeling Tutorial