Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
Injustice - Developers Among Us (SciFiDevCon 2024)
Adversarial and reinforcement learning approaches to optimize information retrieval
1. ADVERSARIAL AND
REINFORCEMENT
LEARNING BASED
APPROACHES TO
INFORMATION RETRIEVAL
Bhaskar Mitra
Principal Applied Scientist, Microsoft AI & Research
Joint work with Daniel Cohen, Katja Hofmann, W. Bruce Croft,
Corby Rosset, Damien Jose, Gargi Ghosh, and Saurabh Tiwary
SIGIR 2018 | Ann Arbor, Michigan
2. Today’s topics: two SIGIR 2018 short papers
Awarded SIGIR 2018 Best Short Paper
https://arxiv.org/abs/1805.03403 https://arxiv.org/abs/1804.04410
3. Cross Domain Regularization
for Neural Ranking Models
Using Adversarial Learning
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, W. Bruce Croft
https://arxiv.org/abs/1805.03403
4. Clever Hans was a horse claimed to have been
capable of performing arithmetic and other
intellectual tasks.
"If the eighth day of the month comes on a
Tuesday, what is the date of the following Friday?“
Hans would answer by tapping his hoof.
In fact, the horse was purported to have been
responding directly to involuntary cues in the
body language of the human trainer, who had the
faculties to solve each problem. The trainer was
entirely unaware that he was providing such cues.
(source: Wikipedia)
5. Duet model for document ranking (2017)
Latent representation learning
models (e.g., duet and DSSM)
“memorize” relationships
between term and entities
6. Today Recent In older
(1990s)
TREC data
Query: uk prime minister
7. Cross domain performance is an important
requirement in many IR scenarios–e.g.,
1. Bing (across markets)
2. Enterprise search (across tenants)
8. BM25 vs.
Inverse document
frequency of terms( )
Duet
Embeddings containing
noisy co-occurrence
information
( )
What corpus statistics do they depend on?
10. The distributed sub-model of duet
Projects query and document
to latent space for matching
Additional fully-connected
layers to estimate relevance
Hidden layers may encode
domain specific statistics
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers 𝑦
query
doc
How do we encourage the model to only learn
features that generalize across multiple domains?
11. The distributed sub-model of duet
Train model on multiple domains
During training, an adversarial
discriminator inspects the hidden
states of the model and tries to
predict the source corpus of the
training sample
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
The duet model, in addition to optimizing for the
ranking loss, also tries to “fool” the adversarial
discriminator – and in the process learns more
domain independent representations
13. Additional regularization for the ranking loss
query
relevant
document
non-relevant
document
parameters of
the adversarial
discriminator
parameters of the
ranking model
15. Gradient reversal
Reverse the gradient from
the discriminator when
back-propagating through
the ranking model
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
≈ ≈
20. Optimizing Query Evaluations
using Reinforcement Learning
for Web Search
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra,
and Saurabh Tiwary
https://arxiv.org/abs/1804.04410
21. Large scale IR systems trade-off search result quality and query response time
In Bing, we have a candidate generation stage followed by multiple rank and prune stages
Typically, we apply machine learning in the re-ranking stages
In this work, we explore reinforcement learning for effective and efficient candidate generation
22. In Bing, the index is distributed over multiple machines
For candidate generation, on each machine the documents are linearly scanned using a match plan
23. When a query comes in, it is automatically
categorized and a pre-defined match plan is
selected
A match plan consists of a sequence of
match rules, and corresponding stopping
criteria
A match rule defines the condition that
a document should satisfy to be selected as
a candidate
The stopping criteria decides when
the index scan using a particular match rule
should terminate—and if the matching
process should continue with the next match
rule, or conclude, or reset to the beginning
of the index
24. Match plans influence the
trade-off between effectiveness
and efficiency
E.g., long queries with rare
intents may require expensive
match plans that consider body
text and search deeper into the
index
In contrast, for popular
navigational queries a shallow
scan against URL and title
metastreams may be sufficient
26. During execution, two accumulators are tracked
u: the number of blocks accessed from disk
v: the cum. number of term matches in all inspected documents
A stopping criteria sets thresholds for each – when either thresholds are met, the scan using
that particular match rule terminates
Matching may then continue with a new match rule, or terminate, or re-start from beginning
27. Typically these match plans are hand-crafted and
statically assigned to different query categories
In this work, we cast match planning as a
reinforcement learning task
30. Reinforcement
learning
(for Bing candidate generation)
Learn a policy πθ : S → A which
maximizes the cumulative
discounted reward R
Where, γ is the discount rate
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
31. Reinforcement
learning
(for Bing candidate generation)
We use table based Q learning
State space: discrete <ut, vt>
Action space:
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
32. Reinforcement
learning
(for Bing candidate generation)
Reward function:
g(di) is the relevance of the ith
document estimated based on the
subsequent L1 ranker score—
considering only top n documents
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
33. Reinforcement
learning
(for Bing candidate generation)
Final reward:
If no new documents are selected,
we assign a small negative reward
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
35. Conclusions
Traditionally, ML models consumer more time and resources to
improve quality of retrieved results
In this work, we argue that ML based approaches can help improve
our response time
Milliseconds saved can translate to material cost savings in query
serving infrastructure or can be re-purposed by upstream systems to
provide better end-user experience
36. THANK YOU!
Blog post: https://www.microsoft.com/en-
us/research/blog/adversarial-and-reinforcement-
learning-based-approaches-to-information-retrieval/
Editor's Notes
Clever Hans was a horse. It was claimed that he could do simple arithmetic. If you asked Hans a question he would respond by tapping his hoof. After a thorough investigation, it was, however, determined that what Clever Hans was really good at was at reading very subtle and, in fact, unintentional clues that his trainer was giving him via his body language. Hans didn’t know arithmetic at all. But he was very good at spotting body language that CORRELATED highly with the right answer.
We have just spoken about how latent matching models “sort of” memorizes term relatedness or co-occurrences from the training data. So if you train such a model on, say, a recent news collection it may learn that the phrase “uk prime minister” is related to Theresa May. Now if you evaluate the same model on older TREC collections where a more meaningful association would have been with John Major, then your model performance may degrade.
This is problematic because what this means is that your model is “overfitting” to the distributions of your training data which may evolve over time or differ across collections. Phrasing it differently, your deep neural model has just very cleverly—like Hans the horse—learnt to depend on interesting correlations that do not generalize and may have ignored the more useful signals for actually modeling relevance.
This is an important problem. Think about an enterprise search solution that needs to cater to a large number of tenants. You train your model on only a few tenants—either because of privacy constraints or because most tenants are too small and you don’t have enough training data for the others. But afterwards you need to deploy the same model to all the tenants. Good cross domain performance would be key in such a setting.
How can we make these deep and large machine learning models—with all their lovely bells and whistles—as robust as a simple BM25 baseline?
A traditional IR model, such as BM25, makes very few assumptions about the target collection. You can argue that the inverse document frequencies (and couple of the BM25 hyper-parameters) are all that you would learn from your collection. Which is why you can throw BM25 at most retrieval task (e.g., TREC or Web ranking in Bing) and it would give you pretty reasonable performance in most cases out-of-the-box. On the other hand, take a deep neural model and train it on Bing Web ranking task and then evaluate it on TREC data and I bet it falls flat on its face.
But the risk of memorizing correlations isn’t only to inferior performances. It also has many strong ethical implications. Many of the real world collections we train on are naturally biased and encode a lot of our own unfortunate stereotypes. Here’s an interesting paper from some of my colleagues at MSR pointing out how word embeddings may encode gender biases when trained on public collections such as Google News dataset.
This is an important problem. Think about an enterprise search solution that needs to cater to a large number of tenants. You train your model on only a few tenants—either because of privacy constraints or because most tenants are too small and you don’t have enough training data for the others. But afterwards you need to deploy the same model to all the tenants. Good cross domain performance would be key in such a setting.
How can we make these deep and large machine learning models—with all their lovely bells and whistles—as robust as a simple BM25 baseline?