SlideShare a Scribd company logo
1 of 245
Deep Learning for Search
(Alternative title: “Neural Information Retrieval”)
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD Student, University College London
@UnderdogGeek bmitra@microsoft.com
Background image modified from source: https://commons.wikimedia.org/wiki/File:Howrah_Bridge_from_the_western_bank_of_the_Ganges.jpg
I am an applied researcher at Bing.
Based in Microsoft Research Montreal.
Previously worked for Microsoft in
Hyderabad, Seattle, and Cambridge.
Part-time PhD candidate at University
College London. My research is on
neural methods for information retrieval.
Originally born and grew up in Kolkata.
Bing
UCL
MSR
Cambridge
MSR Montreal
MSR
Montreal
MSR
Cambridge
UCL
works for
used to be
based in
based
in
doing
PhD at
Neural Information
Retrieval (or neural IR)
is the application of
shallow or deep neural
networks to IR tasks.
THE STATE OF NEURAL INFORMATION RETRIEVAL
GROWING PUBLICATION POPULARITY
AT TOP IR CONFERENCES
STRONG PERFORMANCE AGAINST
TRADITIONAL METHODS IN TREC 2019
Download the slides:
http://bit.ly/dl4search-fire2019
Download the free book:
http://bit.ly/neuralir-intro
Download TREC Deep Learning Track data:
https://microsoft.github.io/TREC-2019-Deep-Learning/
@UnderdogGeek bmitra@microsoft.com
RESOURCES
AGENDA
Let’s focus on the fundamentals!
Please feel free to interrupt and
ask lots of questions!
THE SEARCH TASK
10 MINS
(10:05 AM - 10:15 AM)
INFORMATION RETRIEVAL (IR)
User has an information need
There exists a collection of information resources
IR is the activity of retrieving the information
resources relevant to the information need
EXAMPLE OF AN IR TASK
(WEB SEARCH)
User expresses information need as a short
textual query
The search engine retrieves top relevant web
documents as information resources
We will use web search as the main example of
an IR task in the rest of this lecture
query
Information
need
retrieval system indexes a
document corpus
results ranking (document list)
Relevance
(documents satisfy
information need)
DESIDERATA
Decades of IR research has
identified some key factors that text
retrieval models should consider
Traditional IR models typically
incorporate one of more of these
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
DESIDERATA
A document that contains more occurrences of the
query term(s) is more likely to be relevant
Tip: consider term frequency (TF)
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over
DESIDERATA
A rare term (e.g., “msmarco”) is likely to be more
informative than a common term (e.g., “and”)
Tip: consider inverse document frequency (IDF)
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
more informative than
DESIDERATA
A term should not contribute disproportionately
Increase in TF should have larger impact for smaller TFs
Tip: put a saturation function over the TF
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over
DESIDERATA
A document containing more non-relevant terms is
likely to be less relevant
Tip: perform document length normalization
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over
DESIDERATA
A document containing query terms in close proximity is
likely to be more relevant than one where the terms
occur far away from each other
Tip: consider proximity features
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over
DESIDERATA
Term matches earlier in the document may indicate more
likelihood of the document being relevant
Tip: consider position of term matches
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over
DESIDERATA
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
uk prime minister
The query and the document may refer to the same
concept using different vocabularies
Tip: consider expanding the query or document, or
matching query and document terms in a latent space
theresa may
DESIDERATA
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
albuquerque
By inspecting other terms in the document we may infer
if the document is about the query term
Tip: consider expanding the query or matching the
query terms with the document terms in a latent space
Passage about Albuquerque Passage not about Albuquerque
EXAMPLES OF RANKING METRICS
Discounted Cumulative Gain (DCG)
𝐷𝐶𝐺@𝑘 =
𝑖=1
𝑘
2 𝑟𝑒𝑙𝑖
− 1
𝑙𝑜𝑔2 𝑖 + 1
Reciprocal Rank (RR)
𝑅𝑅@𝑘 = max
1<𝑖<𝑘
𝑟𝑒𝑙𝑖
𝑖
FUNDAMENTALS OF
NEURAL NETWORKS
30 MINS
(10:15 AM - 10:45 AM)
NEURAL
NETWORKS
Chains of parameterized linear transforms (e.g., multiply weight,
add bias) followed by non-linear functions (σ)
Popular choices for σ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
FUNDAMENTAL MACHINE LEARNING TASKS
SQUARED LOSS
The squared loss is a popular loss function for regression tasks
THE SOFTMAX FUNCTION
In neural classification models, the softmax function is popularly used to normalize
the neural network output scores across all the classes
CROSS ENTROPY
The cross entropy between two probability
distributions 𝑝 and 𝑞 over a discrete set of
events is given by,
If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all
other values of 𝑖 then,
CROSS ENTROPY WITH
SOFTMAX LOSS
Cross entropy with softmax is a popular loss
function for classification
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕𝑙
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
STOCHASTIC GRADIENT DESCENT (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕 𝑦 − 𝑦2
2
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤1. 𝑥 + 𝑏1 × 𝑥
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
COMPUTATION
NETWORKS
The “Lego” approach to specifying DNN architectures
Library of computation nodes, each node defines logic for:
1. Forward pass: compute output given input
2. Backward pass: compute gradient of loss w.r.t. inputs,
given gradient of loss w.r.t. outputs
3. Parameter gradient: compute gradient of loss w.r.t.
parameters, given gradient of loss w.r.t. outputs
Chain nodes to create bigger and more complex networks
TOOLKITS
A diverse set of options
to choose from!
Figure from https://towardsdatascience.com/battle-of-
the-deep-learning-frameworks-part-i-cff0e3841750
TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH
First, we define the model
architecture
Next, we specify loss function and
optimization algorithm
Finally, loop over training data to
optimize model parameters
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
REALLY DEEP
NEURAL NETWORKS
(Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
WHY ADDING DEPTH HELPS
http://playground.tensorflow.org
can’t separate using a linear model!
Input features
Label
surface kerberos book library
1 0 1 0 ✓
1 1 0 0 ✗
0 1 0 1 ✓
0 0 1 1 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
VISUAL
MOTIVATION FOR
HIDDEN UNITS
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
VISUAL
MOTIVATION FOR
HIDDEN UNITS
Or more succinctly…
Input features Hidden layer
Label
surface kerberos book library H1 H2
1 0 1 0 1 0 ✓
1 1 0 0 0 0 ✗
0 1 0 1 0 1 ✓
0 0 1 1 0 0 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
can separate using a linear model!
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
WHY ADDING DEPTH HELPS
Deeper networks can split the input space
in many (non-independent) linear regions
than shallow networks
Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
THE LOTTERY
TICKET HYPOTHESIS
BIAS-VARIANCE TRADE-OFF IN THE
DEEP LEARNING ERA
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
LEARNING TO RANK
35 MINS
(10:45 AM - 11:20 AM)
MOST IR SYSTEMS PRESENT
RANKED LISTS OF RETRIEVED
INFORMATION ARTIFACTS
THE UNREASONABLE EFFECTIVENESS
OF SIMPLE LTR BASED APPROACHES
LEARNING TO
RANK (LTR)
”... the task to automatically construct a ranking
model using training data, such that the model
can sort new objects according to their degrees
of relevance, preference, or importance.”
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
LEARNING TO
RANK (LTR)
L2R models represent a rankable item—e.g.,
a document—given some context—e.g., a
user-issued query—as a numerical vector
𝑥 ∈ ℝ 𝑛
The ranking model 𝑓: 𝑥 → ℝ is trained to
map the vector to a real-valued score such
that relevant items are scored higher.
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
WHY IS RANKING CHALLENGING?
Rank based metrics, such as DCG or MRR, are
non-smooth / non-differentiable
APPROACHES
Pointwise approach
Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑.
Pairwise approach
Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCG—difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
FEATURES
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
FEATURES
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
POINTWISE
OBJECTIVES
Regression loss
Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑
e.g., square loss for binary or categorical
labels,
where, 𝑦 𝑞,𝑑 is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
POINTWISE
OBJECTIVES
Classification loss
Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑
e.g., cross-entropy with softmax over
categorical labels 𝑌 [Li et al., 2008],
where, 𝑠 𝑦 𝑞,𝑑
is the model’s score for label 𝑦 𝑞,𝑑
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
PAIRWISE
OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009],
where, 𝜙 can be,
• Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000]
• Exponential function 𝜙 𝑧 = 𝑒−𝑧
[Freund et al., 2003]
• Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧
[Burges et al., 2005]
• Others…
Pairwise loss minimizes the average number of
inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is
ranked higher than 𝑑𝑖
Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document
For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 ,
Feature vectors: 𝑥𝑖 and 𝑥𝑗
Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
PAIRWISE
OBJECTIVES
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]—an industry favourite
[Burges, 2015]
Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡
𝑒 𝛾.𝑠 𝑖
𝑒 𝛾.𝑠 𝑖 +𝑒
𝛾.𝑠 𝑗
=
1
1+𝑒
−𝛾. 𝑠 𝑖−𝑠 𝑗
Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0
Computing cross-entropy between 𝑝 and 𝑝
ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
A GENERALIZED CROSS-ENTROPY LOSS
An alternative loss function assumes a single relevant document 𝑑+ and compares it
against the full collection 𝐷
Predicted probabilities: p 𝑑+|𝑞 =
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
The cross-entropy loss is then given by,
ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
Computing the softmax over the full collection is prohibitively expensive—LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
LISTWISE
OBJECTIVES
Burges et al. [2006] make two observations:
1. To train a model we don’t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
LISTWISE
OBJECTIVES
According to the Luce model [Luce, 2005], given
four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of
observing a particular rank-order, say
𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by:
where, 𝜋 is a particular permutation and 𝜙 is a
transformation (e.g., linear, exponential, or
sigmoid) over the score 𝑠𝑖 corresponding to item
𝑑𝑖
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
LISTWISE
OBJECTIVES
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a “smooth” rank of
documents as a function of their scores
This “smooth” rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
BREAK
10 MINS
(11:20 AM - 11:30 AM)
EMBEDDINGS
45 MINS
(11:30 AM - 12:15 PM)
TYPES OF VECTOR REPRESENTATIONS
Local (or one-hot) representation
Every term in vocabulary T is represented by a
binary vector of length |T|, where one position in
the vector is set to one and the rest to zero
Distributed representation
Every term in vocabulary T is represented by a
real-valued vector of length k. The vector can be
sparse or dense. The vector dimensions may be
observed (e.g., hand-crafted features) or latent
(e.g., embedding dimensions).
Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984
OBSERVED (OR EXPLICIT)
DISTRIBUTED
REPRESENTATIONS
The choice of features is a key consideration
The distributional hypothesis states that
terms that are used (or occur) in similar
context tend to be semantically similar
[Harris, 1954]
Firth [1957] famously purported this idea of
distributional semantics by stating “a word is
characterized by the company it keeps”.
Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, p. 11. Blackwell, Oxford.
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010.
MINOR NOTE: SPOT THE DIFFERENCE!
DISTRIBUTED REPRESENTATION
Vector representations of items as
combinations of different features
or dimensions (as opposed to
one-hot)
DISTRIBUTIONAL SEMANTICS
Linguistic items with similar
distributions (e.g. context words)
have similar meanings
http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/
EXAMPLE: TERM-CONTEXT VECTOR SPACE
T: vocabulary, C: set of contexts, S: sparse matrix |T| x |C|
(PPMI: Positive Pointwise Mutual Information)
C0 c1 c2 … cj … c|C|
t0
t1
t2
…
ti Sij
…
t|T|
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010
t
t
t
t
t
t t
t
t
EXAMPLE: SALTON’S VECTOR SPACE
D: collection, T: vocabulary, S: sparse matrix |D| x |T|
t0 t1 t2 … tj … t|T|
d0
d1
d2
…
di Sij
…
d|D|
S
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975
idf
NOTIONS OF
SIMILARITY
Two terms are similar if their feature
vectors are close
But different feature spaces may capture
different notions of similarity
Is Seattle more similar to…
Sydney (similar type)
or
Seahawks (similar topic)
Depends on your choice of features
NOTIONS OF
SIMILARITY
Consider the following toy corpus…
Now consider the different vector
representations of terms you can derive
from this corpus and how the items that
are similar differ in these vector spaces
NOTIONS OF
SIMILARITY
Topical or Syntagmatic similarity
NOTIONS OF
SIMILARITY
Typical or Paradigmatic similarity
NOTIONS OF
SIMILARITY
A mix of Topical and Typical similarity
NOTIONS OF
SIMILARITY
Consider the following toy corpus…
Now consider the different vector
representations of terms you can derive
from this corpus and how the items that
are similar differ in these vector spaces
RETRIEVAL USING VECTOR REPRESENTATIONS
Map both query and candidate documents
into the same vector space
Retrieve documents closest to the query
e.g., using Salton’s vector space model
Where, 𝑣 𝑞 and 𝑣 𝑑 are vectors of TF-IDF scores
over all terms in the vocabulary
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975
𝑠𝑖𝑚 𝑞, 𝑑 =
𝑣 𝑞. 𝑣 𝑑
𝑣 𝑞 . 𝑣 𝑑
REGULARITIES IN OBSERVED FEATURE SPACES
Some feature spaces capture
interesting linguistic regularities
e.g., simple vector algebra in the
term-neighboring term space may
be useful for word analogy tasks
Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014
EMBEDDINGS
An embedding is a representation of items in
a new space such that the properties of, and
the relationships between, the items are
preserved from the original representation.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
EMBEDDINGS
e.g., 200-dimensional term embedding for “banana”
EMBEDDINGS
Compared to observed feature spaces:
• Embeddings typically have fewer dimensions
• The space may have more disentangled principle
components
• The dimensions may be less interpretable
• The latent representations may generalize better
What’s the advantage of
latent vector spaces over
observed features spaces?
LET’S TAKE AN IR
EXAMPLE
In Salton’s vector space, both
these passages are equidistant
from the query “Albuquerque”
A latent feature representation
may put the first passage closer to
the query because of terms like
“population” and “area”
Passage about Albuquerque
Passage not about Albuquerque
Query: “Albuquerque”
HOW TO LEARN TERM EMBEDDINGS?
Multiple approaches have been
proposed for learning embeddings
from <term, context, count> data
Popular approaches include matrix
factorization or stochastic gradient
descent (SGD)
C0 c1 c2 … cj … c|C|
t0
t1
t2
…
ti Xij
…
t|T|
LATENT SEMANTIC ANALYSIS (LSA)
Perform SVD on X to obtain
its low-rank approximation
Involves finding a solution to
X = 𝑈Σ𝑉T
The embedding for the ith
term is given by Σk 𝑡𝑖
Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.
Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.
LATENT SEMANTIC ANALYSIS (LSA)
WORD2VEC
Goal: simple (shallow) neural model
learning from billion words scale corpus
Predict middle word from neighbors
within a fixed size context window
Two different architectures:
1. Skip-gram
2. CBOW
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
SKIP-GRAM
Predict neighbor 𝑡𝑖+𝑗 given term 𝑡𝑖
THE SKIP-GRAM LOSS
S is the set of all windows over the training text
c is the number of neighbours we need to predict on either side of the term 𝑡𝑖
Full softmax is computationally impractical - hierarchical softmax or negative sampling is employed instead
CONTINUOUS
BAG-OF-WORDS
(CBOW)
Predict the middle term 𝑡𝑖 given
{𝑡𝑖−𝑐, … , 𝑡𝑖−1, 𝑡𝑖+1, … , 𝑡𝑖+𝑐}
THE CBOW LOSS
Note: from every window of text skip-gram generates 2 x c training samples
whereas CBOW generates one – that’s why CBOW trains faster than skip-gram
WORD ANALOGIES
WITH WORD2VEC
W2v is popular for word analogy tasks
But remember the same relationships also
exist in the observed feature space, as we
saw earlier
Let 𝑥𝑖𝑗 be the frequency of the pair 𝑡𝑖, 𝑡𝑗 in
the training data, then
t0 t1 t2 … tj … t|T|
t0
t1
t2
…
ti Xij
…
t|T|
A MATRIX INTERPRETATION OF WORD2VEC
cross-entropy error
actual co-occurrence
probability
predicted co-occurrence
probability
Replace the cross-entropy error
with a squared-error and apply a
saturation function f(…) over 𝑥𝑖𝑗
GLOVE
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
ℒ 𝐺𝑙𝑜𝑉𝑒 = −
𝑖=1
|𝑇|
𝑗=1
|𝑇|
𝑓 𝑥𝑖,𝑗 𝑙𝑜𝑔 𝑥𝑖,𝑗 − 𝑤𝑖
⊺
𝑤𝑗
2
squared error
predicted co-occurrence
probability
saturation function
actual co-occurrence
probability`
PARAGRAPH2VEC
W2v style model where context is
document, not neighboring term
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
RECAP: HOW TO LEARN TERM EMBEDDINGS?
Learn from <term, context, count> data
Choice of context (e.g., neighboring term or container document) defines what
relationship you are modeling
Choice of learning algorithm (e.g., matrix factorization or SGD) defines
how well you model the relationship
Choice of context and learning algorithm are independent – you can use
matrix factorization with neighboring term context, or a w2v-style neural
network with document context (e.g., paragraph2vec)
EXAMPLES OF TEXT EMBEDDINGS
Embedding for Source Item Target Item Learning Model
Latent Semantic Analysis
Deerwester et. al. (1990)
Single word
Word
(one-hot)
Document
(one-hot)
Matrix factorization
Word2vec
Mikolov et. al. (2013)
Single Word
Word
(one-hot)
Neighboring Word
(one-hot)
Neural Network (Shallow)
Glove
Pennington et. al. (2014)
Single Word
Word
(one-hot)
Neighboring Word
(one-hot)
Matrix factorization
Semantic Hashing (auto-encoder)
Salakhutdinov and Hinton (2007)
Multi-word text
Document
(bag-of-words)
Same as source
(bag-of-words)
Neural Network (Deep)
DSSM
Huang et. al. (2013), Shen et. al. (2014)
Multi-word text
Query text
(bag-of-trigrams)
Document title
(bag-of-trigrams)
Neural Network (Deep)
Session DSSM
Mitra (2015)
Multi-word text
Query text
(bag-of-trigrams)
Next query in session
(bag-of-trigrams)
Neural Network (Deep)
Language Model DSSM
Mitra and Craswell (2015)
Multi-word text
Query prefix
(bag-of-trigrams)
Query suffix
(bag-of-trigrams)
Neural Network (Deep)
DEEP NEURAL
NETWORKS
45 MINS
(12:15 PM - 13:00 PM)
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
LET’S TALK (BRIEFLY) ABOUT
SUPERVISION FOR LEARNING
TEXT EMBEDDINGS WITH DNNS
Supervised approach
Ideal if sufficiently labeled training data is available for the target
retrieval task
Unsupervised approach
E.g., training an auto-encoder or a language model on unlabeled
corpus
Hybrid approach
Current state-of-the-art results have employed large-scale
unsupervised pretraining—followed by sufficiently large-scale
supervised fine-tuning towards the target task
SIAMESE NETWORK
Supervised model trained on 𝑞, 𝑑1, 𝑑2 where 𝑑1is relevant to
q, but 𝑑2 is non-relevant
Logistic loss is popularly used—think RankNet where
𝑠𝑖𝑚 𝑣 𝑞, 𝑣 𝑑 is the model score
Typically both left and right models share similar architectures,
but may also choose to share the learnable parameters
AUTOENCODER
Unsupervised models trained to minimize
reconstruction errors
Information Bottleneck method (Tishby et al., 1999)
The bottleneck layer 𝑥 captures “minimal sufficient
statistics” of 𝑣 and is a compressed representation of
the same
LANGUAGE MODELING
A family of language modeling tasks have been
explored in the literature, including:
• Predict next word in a sequence
• Predict masked word in a sequence
• Predict next sentence
Fundamentally the same idea as word2vec and older
neural LMs—but with deeper models and considering
dependencies across longer distances between terms
w1 [MASK]w2 w4
model
?
loss
w3
SHIFT-INVARIANT
NEURAL OPERATIONS
Detecting a pattern in one part of the input space is similar to
detecting it in another
Leverage redundancy by moving a window over the whole
input space and then aggregate
On each instance of the window a kernel—also known as a
filter or a cell—is applied
Different aggregation strategies lead to different architectures
CONVOLUTION
Move the window over the input space each time applying the
same cell over the window
A typical cell operation can be,
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels]
Cell Output [1 x out_channels]
Full Output [1 + (words – window) / stride x out_channels]
POOLING
Move the window over the input space each time applying an
aggregate function over each dimension in within the window
ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 + (words – window) / stride x channels]
max -pooling average -pooling
CONVOLUTION W/
GLOBAL POOLING
Stacking a global pooling layer on top of a convolutional layer
is a common strategy for generating a fixed length embedding
for a variable length text
Full Input [words x in_channels]
Full Output [1 x out_channels]
RECURRENCE
Similar to a convolution layer but additional dependency on
previous hidden state
A simple cell operation shown below but others like LSTM and
GRUs are more popular in practice,
ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels] + [1 x out_channels]
Cell Output [1 x out_channels]
Full Output [1 x out_channels]
RECURSIVE OR TREE-
RNN
Shared weights among all the levels of the tree
Cell can be an LSTM or as simple as
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 x channels]
ATTENTION
Given a set of n items and an input context, produce a
probability distribution {a1, …, ai, …, an} of attending to each item
as a function of similarity between a learned representation (q)
of the context and learned representations (ki) of the items
𝑎𝑖 =
𝜑 𝑞, 𝑘𝑖
𝑗
𝑛
𝜑 𝑞, 𝑘𝑗
The aggregated output is given by 𝑖
𝑛
𝑎𝑖 ∙ 𝑣𝑖
Full Input [words x in_channels], [1 x ctx_channels]
Full Output [1 x out_channels]
* When attending over a sequence (and not a set), the key k and value
v are typically a function of the item and some encoding of the position
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
RESIDUAL NETWORKS
Enabled training of really
deep architectures (up to
152 layers)
Each layer learns the
residual functions with
reference to the layer inputs
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Andreas Veit, Michael J. Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 2016.
TRANSFORMERS
A transformer layer consists of a combination of self-
attention layer and multiple fully-connected or
convolutional layers, with residual connections
A transformer-based encoder can consist of multiple
transformers stacked in sequence
Full Input [words x in_channels]
Full Output [words x out_channels]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
NORMALIZATION
Internal covariate shift refers to the
changing distribution of each layer’s
inputs during training, as the parameters
of the previous layers change
BatchNorm and other normalization
techniques achieve better training
effectiveness by addressing this problem
Sergey Ioffe, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
Image source: https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/
REGULARIZATION
The process of adding information
in order to prevent overfitting.
Popular approaches:
• Dropout
• Regularization loss
• Early stopping
CONTEXTUALIZED
DEEP WORD
EMBEDDINGS
http://jalammar.github.io/illustrated-bert/
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
BERT
Stacked transformer layers
Pretrained on two tasks:
• Masked language modeling
• Next sentence prediction
Input: WordPiece embedding +
position embedding + segment
embedding
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
LUNCH
60 MINS
(13:00 PM - 14:00 PM)
SHALLOW NEURAL
METHODS FOR RANKING
50 MINS
(14:00 PM - 14:50 PM)
THERE IS A LONG HISTORY OF
VECTOR SPACE MODELS (BOTH
DENSE AND SPARSE) IN IR
Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. In CACM, 1975.
Scott Deerwester, et. al. Indexing by latent semantic analysis. In JASIST, 1990.
Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
RETRIEVAL USING VECTOR REPRESENTATIONS
Generate vector
representation of query
Generate vector
representation of document
Estimate relevance from q-d
vectors
Compare query and document
directly in the embedding space
POPULAR APPROACHES TO INCORPORATING
TERM EMBEDDINGS FOR MATCHING
Use embeddings to generate
suitable query expansions
estimate relevance estimate relevance
E.g.,
Generalized Language Model [Ganguly et
al., 2015]
Neural Translation Language Model
[Zuccon et al., 2015]
Average term embeddings [Le and Mikolov,
2014, Nalisnick et al., 2016, Zamani and Croft, 2016,
and others]
Word mover’s distance [Kusner et al., 2015,
Guo et al., 2016]
Compare query and document
directly in the embedding space
estimate relevance
GENERALIZED LANGUAGE MODEL
Traditional language modeling based IR approach may estimate q-d relevance as follows,
where, 𝑝 𝑡 𝑞|𝑑 is the
probability of generating
term 𝑡 𝑞 from document 𝑑
GENERALIZED LANGUAGE MODEL
Traditional language modeling based IR approach may estimate q-d relevance as follows,
𝑝 𝑡 𝑞|𝑑 and 𝑝 𝑡 𝑞|𝐷 are the
probabilities of randomly
sampling term 𝑡 𝑞 from document
𝑑 and the full collection 𝐷,
respectively
𝑝 𝑡 𝑞|𝐷 has a smoothing effect
on the 𝑝 𝑡 𝑞|𝑑 estimation
GENERALIZED LANGUAGE MODEL
GLM includes additional smoothing based on term similarity in the embedding space
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
GENERALIZED LANGUAGE MODEL
GLM includes additional smoothing based on term similarity in the embedding space
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
GENERALIZED LANGUAGE MODEL
GLM includes additional smoothing based on term similarity in the embedding space
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
GENERALIZED LANGUAGE MODEL
GLM includes additional smoothing based on term similarity in the embedding space
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
Probability of generating the
term from the document
based on similarity in the
embedding space
Probability of generating the term
from the full collection based on
similarity in the embedding space
NEURAL TRANSLATION LANGUAGE MODEL
Translation Language Model:
Neural Translation Language Model:
TLM estimates 𝑝 𝑡 𝑞|𝑡 𝑑 from q-d paired data
similar to statistical machine translation
NTLM uses term-term similarity in the
embedding space to estimate 𝑝 𝑡 𝑞|𝑡 𝑑
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.
AVERAGE TERM EMBEDDINGS
Q-D relevance
estimated by
computing cosine
similarity between
centroid of q and d
term embeddings
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
WORD MOVER’S DISTANCE
Based on the Earth Mover’s Distance (EMD)
[Rubner et al., 1998]
Originally proposed by Wan et al. [2005, 2007],
but used WordNet and topic categories
Kusner et al. [2015] incorporated term
embeddings
Adapted for q-d matching by Guo et al. [2016]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In CV, 1998.
Xiaojun Wan and Yuxin Peng. The earth mover’s distance as a semantic measure for document similarity. In CIKM, 2005.
Xiaojun Wan. A novel document similarity measure based on earth mover’s distance. Information Sciences, 2007.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. Semantic matching by non-linear word transportation for information retrieval. In CIKM, 2016.
CHOICE OF TERM EMBEDDINGS
FOR DOCUMENT RANKING
RECAP: for the query “Albuquerque” the relevant
document may contain terms like “population” and “area”
Documents about “Santa Fe” not relevant for this query
“Albuquerque” ↔ “population” (Topically similar) ✓
“Albuquerque” ↔ “Santa Fe” (Typically similar) ✗
Standard LSA and para2vec capture topical similarity,
whereas w2v and GloVe capture a mix of both Top/Typ-ical
Passage about Albuquerque
Passage not about Albuquerque
Query: “Albuquerque”
DUAL EMBEDDING SPACE MODEL
What if I told you that everyone
using word2vec is throwing half
the model away?
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
DUAL EMBEDDING SPACE MODEL
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
IN-OUT captures a more
Topical notion of similarity
than IN-IN and OUT-OUT
Effect is exaggerated
when embeddings are
trained on short text (e.g.,
queries)
DUAL EMBEDDING SPACE MODEL
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
Average term embeddings model, but use IN embeddings for
query terms and OUT embeddings for document terms
DUAL EMBEDDING SPACE MODEL
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
IN+OUT Embeddings for 2.7M words
trained on 600M+ Bing queries
https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e
Download
RELEVANCE-BASED
WORD EMBEDDING
Goal: learn a word embedding that directly
models a topical notion of similarity
Given query q, predict term t sampled from
a smoothed language model (estimated
using PRF) for the query
Hamed Zamani and W. Bruce Croft. Relevance-based word embedding. In SIGIR, 2017.
A TALE OF TWO QUERIES
“PEKAROVIC LAND COMPANY”
Hard to learn good representation for
the rare term pekarovic
But easy to estimate relevance based on
count of exact term matches of
pekarovic in the document
“WHAT CHANNEL ARE THE
SEAHAWKS ON TODAY”
Target document likely contains ESPN
or sky sports instead of channel
The terms ESPN and channel can be
compared in a term embedding space
Matching in the term space is necessary to handle rare terms. Matching in the
latent embedding space can provide additional evidence of relevance. Best
performance is often achieved by combining matching in both vector spaces.
QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity)
Besides the term “Cambridge”, other related terms (e.g., “university”, “town”,
“population”, and “England”) contribute to the relevance of the passage
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity)
However, the same terms may also make a passage about Oxford look somewhat
relevant to the query “Cambridge”
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity)
A passage about giraffes, however, obviously looks non-relevant in the embedding
space…
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity)
But the embedding based matching model is more robust to the same passage when “giraffe” is
replaced by “Cambridge”—a trick that would fool exact term based IR models. In a sense, the
embedding based model ranks this passage low because Cambridge is not "an African even-
toed ungulate mammal“.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
E.g.,
Generalized Language Model [Ganguly et
al., 2015]
Neural Translation Language Model
[Zuccon et al., 2015]
Average term embeddings [Le and Mikolov,
2014, Nalisnick et al., 2016, Zamani and Croft, 2016,
and others]
Word mover’s distance [Kusner et al., 2015,
Guo et al., 2016]
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Hamed Zamani and W Bruce Croft. Estimating embedding vectors for queries. In ICTIR, 2016.
Compare query and document
directly in the embedding space
estimate relevance
Compare query and document
directly in the embedding space
POPULAR APPROACHES TO INCORPORATING
TERM EMBEDDINGS FOR MATCHING
Use embeddings to generate
suitable query expansions
estimate relevance estimate relevance
QUERY EXPANSION USING
TERM EMBEDDINGS
Use embeddings to generate
suitable query expansions
estimate relevance
Find good expansion terms based on nearness in
the embedding space
Better retrieval performance when combined
with pseudo-relevance feedback (PRF) [Zamani and
Croft, 2016] and if we learn query specific term
embeddings [Diaz et al., 2016]
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608, 2016.
Hamed Zamani and W Bruce Croft. Embedding-based query language models. In ICTIR, 2016.
QUERY EXPANSION
θq = UUTθq
Query expansion using PRF
U is the |v| x |D| term-document matrix
Query expansion using word embeddings
U is the |v| x k term embedding matrix
Language model based IR
score(d, q) = KL(θq, θd)
where,
θq is the query language model
θd is the document language model
Word2vec is the sriracha sauce
of deep learning!
“
BUT A GOOD
CHEF…
Would prepare the
appropriate sauce for
each dish.
GLOBAL VS. LOCAL
EMBEDDINGS
Local Global
tax cutting
deficit squeeze
vote reduce
budget slash
reduction reduction
house spend
bill lower
plan halve
spend soften
billion freeze
Nearest neighbors of the word “cut” (as in “gas
cut”) in the embedding space
Uglobal  embedding trained with P(w|C)
Ulocal  embedding trained with P(w|R)
Where,
C is the whole document corpus
R is the set of relevant documents only
QUERY EXPANSION USING
GLOBAL AND LOCAL WORD
EMBEDDINGS
Each point represents a candidate expansion term
Red points have high frequency in the relevant set
of documents
White points have low or no frequency in the
relevant set of documents
The blue point represents the query.
Contours indicate distance from the query
global
local
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.
DEEP NEURAL METHODS
FOR RANKING
90 MINS
(14:50 PM - 16:20 PM)
SEMANTIC
HASHING
Document autoencoder minimizing
reconstruction error
Input: word counts (vocab size = 2K)
Output: binary vector
Stacked RBMs w/ layer-by-layer pre-
training followed by E2E tuning
Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
DEEP SEMANTIC
SIMILARITY
MODEL (DSSM)
Siamese network trained E2E on query and
document title pairs
Relevance is estimated by cosine similarity between
query and document embeddings
Input: character trigraph counts (bag of words
assumption)
Minimizes cross-entropy loss against randomly
sampled negative documents
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
CONVOLUTIONAL
DSSM (CDSSM)
Replace bag-of-words assumption by concatenating
term vectors in a sequence on the input
Convolution followed by global max-pooling
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
REMEMBER…
…how different embedding
spaces capture different
notions of similarity?
DSSM TRAINED ON DIFFERENT TYPES OF DATA
Trained on pairs of… Sample training data Useful for? Paper
Query and document titles <“things to do in seattle”, “seattle tourist attractions”> Document ranking (Shen et al., 2014)
https://dl.acm.org/citation...
Query prefix and suffix <“things to do in”, “seattle”> Query auto-completion (Mitra and Craswell, 2015)
https://dl.acm.org/citation...
Consecutive queries in user
sessions
<“things to do in seattle”, “space needle”> Next query suggestion (Mitra, 2015)
https://dl.acm.org/citation...
Each model captures a different notion of similarity
(or regularity) in the learnt embedding space
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.
Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
Nearest neighbors for “seattle” and “taylor swift” based on two DSSM models
– one trained on query-document pairs and the other trained on query
prefix-suffix pairs
DIFFERENT REGULARITIES IN DIFFERENT
EMBEDDING SPACES
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.
DIFFERENT REGULARITIES IN DIFFERENT
EMBEDDING SPACES
Groups of similar search intent
transitions from a query log
The DSSM trained on session query pairs
can capture regularities in the query space
(similar to word2vec for terms)
Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
DSSM TRAINED ON SESSION QUERY PAIRS
ALLOWS FOR ANALOGIES OVER SHORT TEXT!
Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
INTERACTION-BASED
NETWORKS
Typically a document is relevant if some part of the
document contains information relevant to the query
Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing
the ith window over query terms with the jth window over the
document terms—captures evidence of relevance from
different parts of the document
Additional neural network layers can inspect the interaction
matrix and aggregate the evidence to estimate overall
relevance
Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.
REMEMBER…
…the important of
incorporating exact term
matches as well as matches
in the latent space for
estimating relevance?
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Mitra et al. [2016] argue that both lexical and
semantic matching is important for
document ranking
Duet model is a linear combination of two
DNNs—focusing on lexical and semantic
matching, respectively—jointly trained on
labelled data
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Lexical sub-model operates over input matrix 𝑋
𝑥𝑖,𝑗 =
1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
In relevant documents,
1. Many matches, typically in clusters
2. Matches localized early in document
3. Matches for all query terms
4. In-order (phrasal) matches
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Convolve using window of size 𝑛 𝑑 × 1
Each window instance compares a query term w/
whole document
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Semantic sub-model matches in the latent
embedding space
Match query with moving windows over document
Learn text embeddings specifically for the task
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
BIG VS. SMALL DATA
REGIMES
Big data seems to be more crucial for models that focus on
good representation learning for text
Partial supervision strategies (e.g., unsupervised pre-training of
word embeddings) can be effective but may be leaving the
bigger gains on the table
Learning to train on unlabeled data
may be key to making progress on
neural ad-hoc retrieval
Which IR models are similar?
Clustering based on query level
retrieval performance.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Duet implementation on PyTorch
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
GET THE CODE
WIDE AND DEEP
MODEL
Deep model for representation
learning and wide model for
memorization
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.
KERNEL POOLING
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.
WEB DOCUMENTS ARE MORE THAN
JUST BODY TEXT…
URL
incoming
anchor text
title
body
clicked query
EXTENDING NEURAL RANKING MODELS TO
MULTIPLE DOCUMENT FIELDS
BM25
Neural ranking model
→
→
BM25F
?
RANKING DOCUMENTS
WITH MULTIPLE FIELDS
Learn different embedding space for each
document field
Different fields may match different aspects of
the query—learn different query embeddings for
matching against different fields
Represent per field match by a vector, not a
score
Field level dropout during training can regularize
against over-dependency on any individual field
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
Learn a different embedding space
for each document field
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
For multiple-instance fields,
average pool the instance level
embeddings
Mask empty text instances, and
average only among non-empty
instances to avoid preferring
documents with more instances
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
Learn different query
embeddings for matching
against different fields
Different fields may match
different aspects of the query
Ideal query representation for
matching against URL likely to be
different from for matching with
title
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
Represent per field match by a
vector, not a score
Allows the model to validate
that across the different fields all
aspects of the query intent have
been covered
(Similar intuition as BM25F)
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
Aggregate evidence of relevance
across all document fields
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
High precision fields, such as
clicked queries, can negatively
impact the modeling of the
other fields
Field level dropout during
training can regularize against
over-dependency on any
individual field
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
MANY OTHER NEURAL ARCHITECTURES
(Palangi et al., 2015)
(Kalchbrenner et al., 2014)
(Denil et al., 2014)
(Kim, 2014)
(Severyn and Moschitti, 2015)
(Zhao et al., 2015) (Hu et al., 2014)
(Tai et al., 2015)
(Guo et al., 2016)
(Hui et al., 2017)
(Pang et al., 2017)
(Jaech et al., 2017)
(Dehghani et al., 2017)
BERT FOR RANKING
BERT (and other large-scale unsupervised language models) are
demonstrating dramatic performance improvements on many IR tasks
Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019.
MS MARCO
Query Passage Pair
Query Passage
score
Impact across both academia and industry
BERT FOR RANKING
WHAT DID YOUR MODEL
REALLY LEARN?
While we celebrate the recent performance bumps on
IR tasks from neural methods, it is also important to
recognize when and how they fail
Clever Hans was a horse claimed to have been
capable of performing arithmetic and other
intellectual tasks.
"If the eighth day of the month comes on a
Tuesday, what is the date of the following Friday?“
Hans would answer by tapping his hoof.
In fact, the horse was purported to have been
responding directly to involuntary cues in the
body language of the human trainer, who had the
faculties to solve each problem. The trainer was
entirely unaware that he was providing such cues.
(source: Wikipedia)
BM25 vs.
Inverse document
frequency of terms( )
BERT
Language model of term
co-occurrences( )
What corpus statistics does your model depend on?
WHAT CHANGED
BETWEEN TRAIN
AND TEST?
Terms often change meaning
across domains or over time
Robust retrieval performance is
important (e.g., enterprise search
across multiple tenants)
TodayRecentIn older
(1990s)
TREC data
Query: uk prime minister
domain A domain B domain C domain X
training domains test domain
OPTIMIZING FOR CROSS DOMAIN PERFORMANCE
Baseline model projects
query and document to
latent space for matching
Additional fully-connected
layers to estimate relevance
Hidden layers may encode
domain specific statistics
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers 𝑦
query
doc
How do we encourage the model to only learn
features that generalize across multiple domains?
OPTIMIZING FOR CROSS DOMAIN PERFORMANCE
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
OPTIMIZING FOR CROSS DOMAIN PERFORMANCE
Train model on multiple domains
During training, an adversarial
discriminator inspects the hidden
states of the model and tries to
predict the source corpus of the
training sample
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
The duet model, in addition to optimizing for the
ranking loss, also tries to “fool” the adversarial
discriminator – and in the process learns more
domain independent representations
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
ADDITIONAL REGULARIZATION FOR THE
RANKING LOSS
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
query
relevant
document
non-relevant
document
parameters of
the adversarial
discriminator
parameters of the
ranking model
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
ADDITIONAL REGULARIZATION FOR THE
RANKING LOSS
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
ADDITIONAL REGULARIZATION FOR THE
RANKING LOSS
Reverse the gradient from
the discriminator when
back-propagating through
the ranking model
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
≈ ≈
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
GRADIENT REVERSAL
Adversarial regularization
may also be useful for
mitigating bias
MARRYING THE OLD WITH THE NEW
(SIGIR ’94) (SIGIR ’04) (SIGIR ’05)
source: https://www.eecis.udel.edu/~hfang/AX.html
source: https://www.eecis.udel.edu/~hfang/AX.html
CONNECTION TO NEURAL RANKER TRAINING
Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
AXIOMATIC REGULARIZATION FOR NEURAL
RANKER
Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
BREAK
10 MINS
(16:20 PM - 16:30 PM)
BEYOND RERANKING
45 MINS
(16:30 PM - 17:15 PM)
RETRIEVING, NOT JUST RERANKING, WITH
DEEP NEURAL NETWORKS
Deep ranking models are compute-
intensive and are practically
employed only to rerank top-k
candidates retrieved by more
efficient traditional IR methods
IR performances may be significantly
more impacted if we can also use
them for candidate generation
score
OPTION 1: QUERY INDEPENDENT
DOCUMENT REPRESENTATION
Employ a Siamese network architecture
Compute document representations offline
and query representation at inference time
Efficient online but large offline
computation cost
Effectiveness degrades without interaction
features and lexical term matching
score
APPROXIMATE
K-NN SEARCH
Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. Off the beaten path: Let's replace term-based retrieval with k-nn search. In CIKM, 2016.
LEARNING SPARSE VECTOR REPRESENTATIONS
Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
Hamed Zamani, et al. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In CIKM, 2018.
FAST APPROX. K-NN SEARCH WITH ANNOY
https://github.com/spotify/annoy
Efficient online but large offline
computation cost
Can scale to tail queries but at
higher computation cost—we can
trade-off the two experimentally
OPTION 2: ASSUME QUERY TERM
INDEPENDENCE ASSUMPTION
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
WE TYPICALLY LEARN THE PARAMETERS OF THE
MODEL BY MINIMIZING SOME PAIRWISE LOSS…
e.g.,
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
TERM-DOCUMENT
IMPACT SCORES
d1 d2 d3 d4 d5
t1
t2
t3
t4
t5
If the IR model assumes query term independence,
we can precompute all term-document impact scores
The matrix is generally sparse, either by definition or
by enforcing additional sparsity constraints
(e.g., assume only terms that appear in the document
have non-zero impact)
Precomputed scores can be used with inverted index
for fast retrieval from large collections
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
NEURAL RANKING MODEL WITH QUERY TERM
INDEPENDENCE ASSUMPTION
score
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
THE EFFECTIVENESS-EFFICIENCY TRADE-OFF
The model does not have
the context of the full query
which may result in reduced
effectiveness
However, now we can
precompute everything and
use the learned model in a
full retrieval setting!
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
HOW SERIOUS IS THE
LOSS IN EFFECTIVENESS
FROM ASSUMING QUERY
TERM INDEPENDENCE?
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
Reranking evaluation
Full retrieval evaluation
(on a smaller set of queries than previous table)
DOCUMENT
EXPANSION
Similar to query term
independence approach in that
they are both trying to learn a
better document language model
The training objective here,
however, is to predict relevant
queries and not the target ranking
metric we care about
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document Expansion by Query Prediction. In arXiv, 2019.
TRADING-OFF SEARCH RESULT QUALITY AND QUERY
RESPONSE TIME IN LARGE SCALE IR SYSTEMS
In Bing, we have a candidate generation stage followed by multiple rank and
prune stages
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
In Bing, the index is distributed over multiple machines
For candidate generation, on each machine the documents are linearly scanned using a match plan
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
When a query comes in, it is automatically
categorized and a pre-defined match plan is
selected
A match plan consists of a sequence of
match rules, and corresponding stopping
criteria
A match rule defines the condition that
a document should satisfy to be selected as
a candidate
The stopping criteria decides when the
index scan using a particular match rule
should terminate—and if the matching
process should continue with the next match
rule, or conclude, or reset to the beginning
of the index
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
Match plans influence the
trade-off between
effectiveness and efficiency
E.g., long queries with rare
intents may require expensive
match plans that consider
body text and search deeper
into the index
In contrast, for popular
navigational queries a shallow
scan against URL and title
metastreams may be sufficient
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
E.g.,
Query: halloween costumes
Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T )
Query: facebook login
Match rule: mrB → (facebook ∈ U|T )
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
During execution, two accumulators are tracked
u: the number of blocks accessed from disk
v: the cum. number of term matches in all inspected documents
A stopping criteria sets thresholds for each – when either thresholds are met, the scan using
that particular match rule terminates
Matching may then continue with a new match rule, or terminate, or re-start from beginning
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
TYPICALLY THESE MATCH PLANS ARE HAND-
CRAFTED AND STATICALLY ASSIGNED TO DIFFERENT
QUERY CATEGORIES
WE CAN CAST THE MATCH PLANNING AS A
REINFORCEMENT LEARNING TASK!
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
REINFORCEMENT
LEARNING
environment
action reward
agent
state
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
(for Bing candidate generation)
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
(for Bing candidate generation)
Learn a policy πθ : S → A which
maximizes the cumulative
discounted reward R
Where, γ is the discount rate
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
(for Bing candidate generation)
We use table based Q learning
State space: discrete <ut, vt>
Action space:
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
(for Bing candidate generation)
Reward function:
g(di) is the relevance of the ith
document estimated based on the
subsequent L1 ranker score—
considering only top n documents
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
(for Bing candidate generation)
Final reward:
If no new documents are selected,
we assign a small negative reward
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
RESULTS
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
DEEP LEARNING @ TREC
15 MINS
(17:15 PM - 17:30 PM)
GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new datasetsPast: Proprietary data
1+M queries, weak supervision, open 300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
Dehghani, Zamani, Severyn, Kamps and Croft.
Neural ranking models with weak supervision.
SIGIR 2017
More data
Bettersearchresults
TREC 2019 Deep Learning Track
GENERATING PUBLIC BENCHMARKS FOR NEURAL IR
RESEARCH
A public retrieval and ranking benchmark
with large scale training data (~400K
queries with manual relevance labels)
DERIVING OUR TREC 2019 DATASETS
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 of 10
answers the query
MS MARCO Passage
Retrieval Leaderboard
• Corpus: Union of
10-passage sets
• Labels: From the
~1 positive passage
TREC 2019 Task:
Passage Retrieval
• Same corpus,
training Q+labels
• New reusable NIST
test set
TREC 2019 Task:
Document Retrieval
• Corpus:
Documents (crawl
passage urls)
• Labels: Transfer
from passage to
doc
• New reusable NIST
test set
http://msmarco.org
https://microsoft.github.io/TREC-2019-Deep-Learning/
SETUP OF THE 2019 DEEP LEARNING TRACK
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BERT-style language model
• “nn”: Runs that do representation learning
• “trad”: Runs using only traditional IR features (such as BM25 and RM3)
• Subtasks:
• “fullrank”: End-to-end retrieval
• “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25.
Task Training data Test data Corpus
1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents
2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass
labels
8.8M passages
* Mostly-overlapping query sets (41 shared)
DATASET AVAILABILITY
• Corpus + train + dev data for both tasks
available now from the DL Track site*
• NIST test sets available to participants now
• [Broader availability in Feb 2020]
* https://microsoft.github.io/TREC-2019-Deep-Learning/
SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS
Download the slides:
http://bit.ly/dl4search-fire2019
Download the free book:
http://bit.ly/neuralir-intro
Download TREC Deep Learning Track data:
https://microsoft.github.io/TREC-2019-Deep-Learning/
@UnderdogGeek bmitra@microsoft.com
THANK YOU

More Related Content

What's hot

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web searchhyunsung lee
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep LearningOswald Campesato
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graphHarry Potter
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 

What's hot (15)

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Lect4
Lect4Lect4
Lect4
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Text categorization
Text categorizationText categorization
Text categorization
 

Similar to Deep Learning for Search

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Oswald Campesato
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowOswald Campesato
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowS N
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Oswald Campesato
 

Similar to Deep Learning for Search (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
C3 w2
C3 w2C3 w2
C3 w2
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 

More from Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?Bhaskar Mitra
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Bhaskar Mitra
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 

More from Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Deep Learning for Search

  • 1. Deep Learning for Search (Alternative title: “Neural Information Retrieval”) Bhaskar Mitra Principal Applied Scientist, Microsoft PhD Student, University College London @UnderdogGeek bmitra@microsoft.com Background image modified from source: https://commons.wikimedia.org/wiki/File:Howrah_Bridge_from_the_western_bank_of_the_Ganges.jpg
  • 2. I am an applied researcher at Bing. Based in Microsoft Research Montreal. Previously worked for Microsoft in Hyderabad, Seattle, and Cambridge. Part-time PhD candidate at University College London. My research is on neural methods for information retrieval. Originally born and grew up in Kolkata. Bing UCL MSR Cambridge MSR Montreal MSR Montreal MSR Cambridge UCL works for used to be based in based in doing PhD at
  • 3. Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks.
  • 4. THE STATE OF NEURAL INFORMATION RETRIEVAL GROWING PUBLICATION POPULARITY AT TOP IR CONFERENCES STRONG PERFORMANCE AGAINST TRADITIONAL METHODS IN TREC 2019
  • 5. Download the slides: http://bit.ly/dl4search-fire2019 Download the free book: http://bit.ly/neuralir-intro Download TREC Deep Learning Track data: https://microsoft.github.io/TREC-2019-Deep-Learning/ @UnderdogGeek bmitra@microsoft.com RESOURCES
  • 6. AGENDA Let’s focus on the fundamentals! Please feel free to interrupt and ask lots of questions!
  • 7. THE SEARCH TASK 10 MINS (10:05 AM - 10:15 AM)
  • 8. INFORMATION RETRIEVAL (IR) User has an information need There exists a collection of information resources IR is the activity of retrieving the information resources relevant to the information need
  • 9. EXAMPLE OF AN IR TASK (WEB SEARCH) User expresses information need as a short textual query The search engine retrieves top relevant web documents as information resources We will use web search as the main example of an IR task in the rest of this lecture query Information need retrieval system indexes a document corpus results ranking (document list) Relevance (documents satisfy information need)
  • 10. DESIDERATA Decades of IR research has identified some key factors that text retrieval models should consider Traditional IR models typically incorporate one of more of these Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness
  • 11. DESIDERATA A document that contains more occurrences of the query term(s) is more likely to be relevant Tip: consider term frequency (TF) Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
  • 12. DESIDERATA A rare term (e.g., “msmarco”) is likely to be more informative than a common term (e.g., “and”) Tip: consider inverse document frequency (IDF) Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness more informative than
  • 13. DESIDERATA A term should not contribute disproportionately Increase in TF should have larger impact for smaller TFs Tip: put a saturation function over the TF Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
  • 14. DESIDERATA A document containing more non-relevant terms is likely to be less relevant Tip: perform document length normalization Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
  • 15. DESIDERATA A document containing query terms in close proximity is likely to be more relevant than one where the terms occur far away from each other Tip: consider proximity features Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
  • 16. DESIDERATA Term matches earlier in the document may indicate more likelihood of the document being relevant Tip: consider position of term matches Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness preferable over
  • 17. DESIDERATA Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness uk prime minister The query and the document may refer to the same concept using different vocabularies Tip: consider expanding the query or document, or matching query and document terms in a latent space theresa may
  • 18. DESIDERATA Term frequency Term weighting Term saturation Document length Term proximity Term position Vocabulary mismatch Term aboutness albuquerque By inspecting other terms in the document we may infer if the document is about the query term Tip: consider expanding the query or matching the query terms with the document terms in a latent space Passage about Albuquerque Passage not about Albuquerque
  • 19. EXAMPLES OF RANKING METRICS Discounted Cumulative Gain (DCG) 𝐷𝐶𝐺@𝑘 = 𝑖=1 𝑘 2 𝑟𝑒𝑙𝑖 − 1 𝑙𝑜𝑔2 𝑖 + 1 Reciprocal Rank (RR) 𝑅𝑅@𝑘 = max 1<𝑖<𝑘 𝑟𝑒𝑙𝑖 𝑖
  • 20. FUNDAMENTALS OF NEURAL NETWORKS 30 MINS (10:15 AM - 10:45 AM)
  • 21. NEURAL NETWORKS Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (σ) Popular choices for σ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 23. SQUARED LOSS The squared loss is a popular loss function for regression tasks
  • 24. THE SOFTMAX FUNCTION In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 25. CROSS ENTROPY The cross entropy between two probability distributions 𝑝 and 𝑞 over a discrete set of events is given by, If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all other values of 𝑖 then,
  • 26. CROSS ENTROPY WITH SOFTMAX LOSS Cross entropy with softmax is a popular loss function for classification
  • 27. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕𝑙 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 STOCHASTIC GRADIENT DESCENT (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 28. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕 𝑦 − 𝑦2 2 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 29. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 30. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 31. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 32. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 33. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤1. 𝑥 + 𝑏1 × 𝑥 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 34. COMPUTATION NETWORKS The “Lego” approach to specifying DNN architectures Library of computation nodes, each node defines logic for: 1. Forward pass: compute output given input 2. Backward pass: compute gradient of loss w.r.t. inputs, given gradient of loss w.r.t. outputs 3. Parameter gradient: compute gradient of loss w.r.t. parameters, given gradient of loss w.r.t. outputs Chain nodes to create bigger and more complex networks
  • 35. TOOLKITS A diverse set of options to choose from! Figure from https://towardsdatascience.com/battle-of- the-deep-learning-frameworks-part-i-cff0e3841750
  • 36. TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH First, we define the model architecture Next, we specify loss function and optimization algorithm Finally, loop over training data to optimize model parameters https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
  • 37. REALLY DEEP NEURAL NETWORKS (Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
  • 38. WHY ADDING DEPTH HELPS http://playground.tensorflow.org
  • 39. can’t separate using a linear model! Input features Label surface kerberos book library 1 0 1 0 ✓ 1 1 0 0 ✗ 0 1 0 1 ✓ 0 0 1 1 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… VISUAL MOTIVATION FOR HIDDEN UNITS Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
  • 40. VISUAL MOTIVATION FOR HIDDEN UNITS Or more succinctly… Input features Hidden layer Label surface kerberos book library H1 H2 1 0 1 0 1 0 ✓ 1 1 0 0 0 0 ✗ 0 1 0 1 0 1 ✓ 0 0 1 1 0 0 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… can separate using a linear model! Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
  • 41. WHY ADDING DEPTH HELPS Deeper networks can split the input space in many (non-independent) linear regions than shallow networks Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014
  • 42. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. THE LOTTERY TICKET HYPOTHESIS
  • 43. BIAS-VARIANCE TRADE-OFF IN THE DEEP LEARNING ERA Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
  • 44. LEARNING TO RANK 35 MINS (10:45 AM - 11:20 AM)
  • 45. MOST IR SYSTEMS PRESENT RANKED LISTS OF RETRIEVED INFORMATION ARTIFACTS
  • 46. THE UNREASONABLE EFFECTIVENESS OF SIMPLE LTR BASED APPROACHES
  • 47. LEARNING TO RANK (LTR) ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 48. LEARNING TO RANK (LTR) L2R models represent a rankable item—e.g., a document—given some context—e.g., a user-issued query—as a numerical vector 𝑥 ∈ ℝ 𝑛 The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued score such that relevant items are scored higher. Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 49. WHY IS RANKING CHALLENGING? Rank based metrics, such as DCG or MRR, are non-smooth / non-differentiable
  • 50. APPROACHES Pointwise approach Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑. Pairwise approach Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 51. FEATURES They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 52. FEATURES Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 53. POINTWISE OBJECTIVES Regression loss Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑 e.g., square loss for binary or categorical labels, where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  • 54. POINTWISE OBJECTIVES Classification loss Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑 e.g., cross-entropy with softmax over categorical labels 𝑌 [Li et al., 2008], where, 𝑠 𝑦 𝑞,𝑑 is the model’s score for label 𝑦 𝑞,𝑑 labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 55. PAIRWISE OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009], where, 𝜙 can be, • Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000] • Exponential function 𝜙 𝑧 = 𝑒−𝑧 [Freund et al., 2003] • Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧 [Burges et al., 2005] • Others… Pairwise loss minimizes the average number of inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is ranked higher than 𝑑𝑖 Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 , Feature vectors: 𝑥𝑖 and 𝑥𝑗 Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗 Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 56. PAIRWISE OBJECTIVES RankNet loss Pairwise loss function proposed by Burges et al. [2005]—an industry favourite [Burges, 2015] Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡ 𝑒 𝛾.𝑠 𝑖 𝑒 𝛾.𝑠 𝑖 +𝑒 𝛾.𝑠 𝑗 = 1 1+𝑒 −𝛾. 𝑠 𝑖−𝑠 𝑗 Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0 Computing cross-entropy between 𝑝 and 𝑝 ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗 pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 57. A GENERALIZED CROSS-ENTROPY LOSS An alternative loss function assumes a single relevant document 𝑑+ and compares it against the full collection 𝐷 Predicted probabilities: p 𝑑+|𝑞 = 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 The cross-entropy loss is then given by, ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 Computing the softmax over the full collection is prohibitively expensive—LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 58. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 59. LISTWISE OBJECTIVES Burges et al. [2006] make two observations: 1. To train a model we don’t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 60. LISTWISE OBJECTIVES According to the Luce model [Luce, 2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of observing a particular rank-order, say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by: where, 𝜋 is a particular permutation and 𝜙 is a transformation (e.g., linear, exponential, or sigmoid) over the score 𝑠𝑖 corresponding to item 𝑑𝑖 R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 61. LISTWISE OBJECTIVES Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a “smooth” rank of documents as a function of their scores This “smooth” rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss
  • 62. BREAK 10 MINS (11:20 AM - 11:30 AM)
  • 64. TYPES OF VECTOR REPRESENTATIONS Local (or one-hot) representation Every term in vocabulary T is represented by a binary vector of length |T|, where one position in the vector is set to one and the rest to zero Distributed representation Every term in vocabulary T is represented by a real-valued vector of length k. The vector can be sparse or dense. The vector dimensions may be observed (e.g., hand-crafted features) or latent (e.g., embedding dimensions).
  • 65. Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984
  • 66. OBSERVED (OR EXPLICIT) DISTRIBUTED REPRESENTATIONS The choice of features is a key consideration The distributional hypothesis states that terms that are used (or occur) in similar context tend to be semantically similar [Harris, 1954] Firth [1957] famously purported this idea of distributional semantics by stating “a word is characterized by the company it keeps”. Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954. Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, p. 11. Blackwell, Oxford. Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010.
  • 67. MINOR NOTE: SPOT THE DIFFERENCE! DISTRIBUTED REPRESENTATION Vector representations of items as combinations of different features or dimensions (as opposed to one-hot) DISTRIBUTIONAL SEMANTICS Linguistic items with similar distributions (e.g. context words) have similar meanings http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/
  • 68. EXAMPLE: TERM-CONTEXT VECTOR SPACE T: vocabulary, C: set of contexts, S: sparse matrix |T| x |C| (PPMI: Positive Pointwise Mutual Information) C0 c1 c2 … cj … c|C| t0 t1 t2 … ti Sij … t|T| Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010 t t t t t t t t t
  • 69. EXAMPLE: SALTON’S VECTOR SPACE D: collection, T: vocabulary, S: sparse matrix |D| x |T| t0 t1 t2 … tj … t|T| d0 d1 d2 … di Sij … d|D| S G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975 idf
  • 70. NOTIONS OF SIMILARITY Two terms are similar if their feature vectors are close But different feature spaces may capture different notions of similarity Is Seattle more similar to… Sydney (similar type) or Seahawks (similar topic) Depends on your choice of features
  • 71. NOTIONS OF SIMILARITY Consider the following toy corpus… Now consider the different vector representations of terms you can derive from this corpus and how the items that are similar differ in these vector spaces
  • 72. NOTIONS OF SIMILARITY Topical or Syntagmatic similarity
  • 73. NOTIONS OF SIMILARITY Typical or Paradigmatic similarity
  • 74. NOTIONS OF SIMILARITY A mix of Topical and Typical similarity
  • 75. NOTIONS OF SIMILARITY Consider the following toy corpus… Now consider the different vector representations of terms you can derive from this corpus and how the items that are similar differ in these vector spaces
  • 76. RETRIEVAL USING VECTOR REPRESENTATIONS Map both query and candidate documents into the same vector space Retrieve documents closest to the query e.g., using Salton’s vector space model Where, 𝑣 𝑞 and 𝑣 𝑑 are vectors of TF-IDF scores over all terms in the vocabulary G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975 𝑠𝑖𝑚 𝑞, 𝑑 = 𝑣 𝑞. 𝑣 𝑑 𝑣 𝑞 . 𝑣 𝑑
  • 77. REGULARITIES IN OBSERVED FEATURE SPACES Some feature spaces capture interesting linguistic regularities e.g., simple vector algebra in the term-neighboring term space may be useful for word analogy tasks Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014
  • 78. EMBEDDINGS An embedding is a representation of items in a new space such that the properties of, and the relationships between, the items are preserved from the original representation. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
  • 79. EMBEDDINGS e.g., 200-dimensional term embedding for “banana”
  • 80. EMBEDDINGS Compared to observed feature spaces: • Embeddings typically have fewer dimensions • The space may have more disentangled principle components • The dimensions may be less interpretable • The latent representations may generalize better
  • 81. What’s the advantage of latent vector spaces over observed features spaces?
  • 82. LET’S TAKE AN IR EXAMPLE In Salton’s vector space, both these passages are equidistant from the query “Albuquerque” A latent feature representation may put the first passage closer to the query because of terms like “population” and “area” Passage about Albuquerque Passage not about Albuquerque Query: “Albuquerque”
  • 83. HOW TO LEARN TERM EMBEDDINGS? Multiple approaches have been proposed for learning embeddings from <term, context, count> data Popular approaches include matrix factorization or stochastic gradient descent (SGD) C0 c1 c2 … cj … c|C| t0 t1 t2 … ti Xij … t|T|
  • 84. LATENT SEMANTIC ANALYSIS (LSA) Perform SVD on X to obtain its low-rank approximation Involves finding a solution to X = 𝑈Σ𝑉T The embedding for the ith term is given by Σk 𝑡𝑖 Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.
  • 85. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990. LATENT SEMANTIC ANALYSIS (LSA)
  • 86. WORD2VEC Goal: simple (shallow) neural model learning from billion words scale corpus Predict middle word from neighbors within a fixed size context window Two different architectures: 1. Skip-gram 2. CBOW Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  • 88. THE SKIP-GRAM LOSS S is the set of all windows over the training text c is the number of neighbours we need to predict on either side of the term 𝑡𝑖 Full softmax is computationally impractical - hierarchical softmax or negative sampling is employed instead
  • 89. CONTINUOUS BAG-OF-WORDS (CBOW) Predict the middle term 𝑡𝑖 given {𝑡𝑖−𝑐, … , 𝑡𝑖−1, 𝑡𝑖+1, … , 𝑡𝑖+𝑐}
  • 90. THE CBOW LOSS Note: from every window of text skip-gram generates 2 x c training samples whereas CBOW generates one – that’s why CBOW trains faster than skip-gram
  • 91. WORD ANALOGIES WITH WORD2VEC W2v is popular for word analogy tasks But remember the same relationships also exist in the observed feature space, as we saw earlier
  • 92. Let 𝑥𝑖𝑗 be the frequency of the pair 𝑡𝑖, 𝑡𝑗 in the training data, then t0 t1 t2 … tj … t|T| t0 t1 t2 … ti Xij … t|T| A MATRIX INTERPRETATION OF WORD2VEC cross-entropy error actual co-occurrence probability predicted co-occurrence probability
  • 93. Replace the cross-entropy error with a squared-error and apply a saturation function f(…) over 𝑥𝑖𝑗 GLOVE Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014. ℒ 𝐺𝑙𝑜𝑉𝑒 = − 𝑖=1 |𝑇| 𝑗=1 |𝑇| 𝑓 𝑥𝑖,𝑗 𝑙𝑜𝑔 𝑥𝑖,𝑗 − 𝑤𝑖 ⊺ 𝑤𝑗 2 squared error predicted co-occurrence probability saturation function actual co-occurrence probability`
  • 94. PARAGRAPH2VEC W2v style model where context is document, not neighboring term Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
  • 95. RECAP: HOW TO LEARN TERM EMBEDDINGS? Learn from <term, context, count> data Choice of context (e.g., neighboring term or container document) defines what relationship you are modeling Choice of learning algorithm (e.g., matrix factorization or SGD) defines how well you model the relationship Choice of context and learning algorithm are independent – you can use matrix factorization with neighboring term context, or a w2v-style neural network with document context (e.g., paragraph2vec)
  • 96. EXAMPLES OF TEXT EMBEDDINGS Embedding for Source Item Target Item Learning Model Latent Semantic Analysis Deerwester et. al. (1990) Single word Word (one-hot) Document (one-hot) Matrix factorization Word2vec Mikolov et. al. (2013) Single Word Word (one-hot) Neighboring Word (one-hot) Neural Network (Shallow) Glove Pennington et. al. (2014) Single Word Word (one-hot) Neighboring Word (one-hot) Matrix factorization Semantic Hashing (auto-encoder) Salakhutdinov and Hinton (2007) Multi-word text Document (bag-of-words) Same as source (bag-of-words) Neural Network (Deep) DSSM Huang et. al. (2013), Shen et. al. (2014) Multi-word text Query text (bag-of-trigrams) Document title (bag-of-trigrams) Neural Network (Deep) Session DSSM Mitra (2015) Multi-word text Query text (bag-of-trigrams) Next query in session (bag-of-trigrams) Neural Network (Deep) Language Model DSSM Mitra and Craswell (2015) Multi-word text Query prefix (bag-of-trigrams) Query suffix (bag-of-trigrams) Neural Network (Deep)
  • 98. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 99. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 100. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 101. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 102. LET’S TALK (BRIEFLY) ABOUT SUPERVISION FOR LEARNING TEXT EMBEDDINGS WITH DNNS Supervised approach Ideal if sufficiently labeled training data is available for the target retrieval task Unsupervised approach E.g., training an auto-encoder or a language model on unlabeled corpus Hybrid approach Current state-of-the-art results have employed large-scale unsupervised pretraining—followed by sufficiently large-scale supervised fine-tuning towards the target task
  • 103. SIAMESE NETWORK Supervised model trained on 𝑞, 𝑑1, 𝑑2 where 𝑑1is relevant to q, but 𝑑2 is non-relevant Logistic loss is popularly used—think RankNet where 𝑠𝑖𝑚 𝑣 𝑞, 𝑣 𝑑 is the model score Typically both left and right models share similar architectures, but may also choose to share the learnable parameters
  • 104. AUTOENCODER Unsupervised models trained to minimize reconstruction errors Information Bottleneck method (Tishby et al., 1999) The bottleneck layer 𝑥 captures “minimal sufficient statistics” of 𝑣 and is a compressed representation of the same
  • 105. LANGUAGE MODELING A family of language modeling tasks have been explored in the literature, including: • Predict next word in a sequence • Predict masked word in a sequence • Predict next sentence Fundamentally the same idea as word2vec and older neural LMs—but with deeper models and considering dependencies across longer distances between terms w1 [MASK]w2 w4 model ? loss w3
  • 106. SHIFT-INVARIANT NEURAL OPERATIONS Detecting a pattern in one part of the input space is similar to detecting it in another Leverage redundancy by moving a window over the whole input space and then aggregate On each instance of the window a kernel—also known as a filter or a cell—is applied Different aggregation strategies lead to different architectures
  • 107. CONVOLUTION Move the window over the input space each time applying the same cell over the window A typical cell operation can be, ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] Cell Output [1 x out_channels] Full Output [1 + (words – window) / stride x out_channels]
  • 108. POOLING Move the window over the input space each time applying an aggregate function over each dimension in within the window ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 + (words – window) / stride x channels] max -pooling average -pooling
  • 109. CONVOLUTION W/ GLOBAL POOLING Stacking a global pooling layer on top of a convolutional layer is a common strategy for generating a fixed length embedding for a variable length text Full Input [words x in_channels] Full Output [1 x out_channels]
  • 110. RECURRENCE Similar to a convolution layer but additional dependency on previous hidden state A simple cell operation shown below but others like LSTM and GRUs are more popular in practice, ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] + [1 x out_channels] Cell Output [1 x out_channels] Full Output [1 x out_channels]
  • 111. RECURSIVE OR TREE- RNN Shared weights among all the levels of the tree Cell can be an LSTM or as simple as ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 x channels]
  • 112. ATTENTION Given a set of n items and an input context, produce a probability distribution {a1, …, ai, …, an} of attending to each item as a function of similarity between a learned representation (q) of the context and learned representations (ki) of the items 𝑎𝑖 = 𝜑 𝑞, 𝑘𝑖 𝑗 𝑛 𝜑 𝑞, 𝑘𝑗 The aggregated output is given by 𝑖 𝑛 𝑎𝑖 ∙ 𝑣𝑖 Full Input [words x in_channels], [1 x ctx_channels] Full Output [1 x out_channels] * When attending over a sequence (and not a set), the key k and value v are typically a function of the item and some encoding of the position
  • 113. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 114. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 115. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 116. RESIDUAL NETWORKS Enabled training of really deep architectures (up to 152 layers) Each layer learns the residual functions with reference to the layer inputs Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Andreas Veit, Michael J. Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 2016.
  • 117. TRANSFORMERS A transformer layer consists of a combination of self- attention layer and multiple fully-connected or convolutional layers, with residual connections A transformer-based encoder can consist of multiple transformers stacked in sequence Full Input [words x in_channels] Full Output [words x out_channels] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • 118. NORMALIZATION Internal covariate shift refers to the changing distribution of each layer’s inputs during training, as the parameters of the previous layers change BatchNorm and other normalization techniques achieve better training effectiveness by addressing this problem Sergey Ioffe, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Image source: https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/
  • 119. REGULARIZATION The process of adding information in order to prevent overfitting. Popular approaches: • Dropout • Regularization loss • Early stopping
  • 120. CONTEXTUALIZED DEEP WORD EMBEDDINGS http://jalammar.github.io/illustrated-bert/ Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
  • 121. BERT Stacked transformer layers Pretrained on two tasks: • Masked language modeling • Next sentence prediction Input: WordPiece embedding + position embedding + segment embedding Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
  • 122. LUNCH 60 MINS (13:00 PM - 14:00 PM)
  • 123. SHALLOW NEURAL METHODS FOR RANKING 50 MINS (14:00 PM - 14:50 PM)
  • 124. THERE IS A LONG HISTORY OF VECTOR SPACE MODELS (BOTH DENSE AND SPARSE) IN IR Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. In CACM, 1975. Scott Deerwester, et. al. Indexing by latent semantic analysis. In JASIST, 1990. Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
  • 125. RETRIEVAL USING VECTOR REPRESENTATIONS Generate vector representation of query Generate vector representation of document Estimate relevance from q-d vectors
  • 126. Compare query and document directly in the embedding space POPULAR APPROACHES TO INCORPORATING TERM EMBEDDINGS FOR MATCHING Use embeddings to generate suitable query expansions estimate relevance estimate relevance
  • 127. E.g., Generalized Language Model [Ganguly et al., 2015] Neural Translation Language Model [Zuccon et al., 2015] Average term embeddings [Le and Mikolov, 2014, Nalisnick et al., 2016, Zamani and Croft, 2016, and others] Word mover’s distance [Kusner et al., 2015, Guo et al., 2016] Compare query and document directly in the embedding space estimate relevance
  • 128. GENERALIZED LANGUAGE MODEL Traditional language modeling based IR approach may estimate q-d relevance as follows, where, 𝑝 𝑡 𝑞|𝑑 is the probability of generating term 𝑡 𝑞 from document 𝑑
  • 129. GENERALIZED LANGUAGE MODEL Traditional language modeling based IR approach may estimate q-d relevance as follows, 𝑝 𝑡 𝑞|𝑑 and 𝑝 𝑡 𝑞|𝐷 are the probabilities of randomly sampling term 𝑡 𝑞 from document 𝑑 and the full collection 𝐷, respectively 𝑝 𝑡 𝑞|𝐷 has a smoothing effect on the 𝑝 𝑡 𝑞|𝑑 estimation
  • 130. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
  • 131. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
  • 132. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
  • 133. GENERALIZED LANGUAGE MODEL GLM includes additional smoothing based on term similarity in the embedding space Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015. Probability of generating the term from the document based on similarity in the embedding space Probability of generating the term from the full collection based on similarity in the embedding space
  • 134. NEURAL TRANSLATION LANGUAGE MODEL Translation Language Model: Neural Translation Language Model: TLM estimates 𝑝 𝑡 𝑞|𝑡 𝑑 from q-d paired data similar to statistical machine translation NTLM uses term-term similarity in the embedding space to estimate 𝑝 𝑡 𝑞|𝑡 𝑑 Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.
  • 135. AVERAGE TERM EMBEDDINGS Q-D relevance estimated by computing cosine similarity between centroid of q and d term embeddings Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 136. WORD MOVER’S DISTANCE Based on the Earth Mover’s Distance (EMD) [Rubner et al., 1998] Originally proposed by Wan et al. [2005, 2007], but used WordNet and topic categories Kusner et al. [2015] incorporated term embeddings Adapted for q-d matching by Guo et al. [2016] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In CV, 1998. Xiaojun Wan and Yuxin Peng. The earth mover’s distance as a semantic measure for document similarity. In CIKM, 2005. Xiaojun Wan. A novel document similarity measure based on earth mover’s distance. Information Sciences, 2007. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015. Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. Semantic matching by non-linear word transportation for information retrieval. In CIKM, 2016.
  • 137.
  • 138. CHOICE OF TERM EMBEDDINGS FOR DOCUMENT RANKING RECAP: for the query “Albuquerque” the relevant document may contain terms like “population” and “area” Documents about “Santa Fe” not relevant for this query “Albuquerque” ↔ “population” (Topically similar) ✓ “Albuquerque” ↔ “Santa Fe” (Typically similar) ✗ Standard LSA and para2vec capture topical similarity, whereas w2v and GloVe capture a mix of both Top/Typ-ical Passage about Albuquerque Passage not about Albuquerque Query: “Albuquerque”
  • 139. DUAL EMBEDDING SPACE MODEL What if I told you that everyone using word2vec is throwing half the model away? Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 140. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016. IN-OUT captures a more Topical notion of similarity than IN-IN and OUT-OUT Effect is exaggerated when embeddings are trained on short text (e.g., queries)
  • 141. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016. Average term embeddings model, but use IN embeddings for query terms and OUT embeddings for document terms
  • 142. DUAL EMBEDDING SPACE MODEL Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 143. IN+OUT Embeddings for 2.7M words trained on 600M+ Bing queries https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e Download
  • 144. RELEVANCE-BASED WORD EMBEDDING Goal: learn a word embedding that directly models a topical notion of similarity Given query q, predict term t sampled from a smoothed language model (estimated using PRF) for the query Hamed Zamani and W. Bruce Croft. Relevance-based word embedding. In SIGIR, 2017.
  • 145. A TALE OF TWO QUERIES “PEKAROVIC LAND COMPANY” Hard to learn good representation for the rare term pekarovic But easy to estimate relevance based on count of exact term matches of pekarovic in the document “WHAT CHANNEL ARE THE SEAHAWKS ON TODAY” Target document likely contains ESPN or sky sports instead of channel The terms ESPN and channel can be compared in a term embedding space Matching in the term space is necessary to handle rare terms. Matching in the latent embedding space can provide additional evidence of relevance. Best performance is often achieved by combining matching in both vector spaces.
  • 146. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) Besides the term “Cambridge”, other related terms (e.g., “university”, “town”, “population”, and “England”) contribute to the relevance of the passage Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 147. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) However, the same terms may also make a passage about Oxford look somewhat relevant to the query “Cambridge” Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 148. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) A passage about giraffes, however, obviously looks non-relevant in the embedding space… Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 149. QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity) But the embedding based matching model is more robust to the same passage when “giraffe” is replaced by “Cambridge”—a trick that would fool exact term based IR models. In a sense, the embedding based model ranks this passage low because Cambridge is not "an African even- toed ungulate mammal“. Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.
  • 150. E.g., Generalized Language Model [Ganguly et al., 2015] Neural Translation Language Model [Zuccon et al., 2015] Average term embeddings [Le and Mikolov, 2014, Nalisnick et al., 2016, Zamani and Croft, 2016, and others] Word mover’s distance [Kusner et al., 2015, Guo et al., 2016] Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015. Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014. Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016. Hamed Zamani and W Bruce Croft. Estimating embedding vectors for queries. In ICTIR, 2016. Compare query and document directly in the embedding space estimate relevance
  • 151. Compare query and document directly in the embedding space POPULAR APPROACHES TO INCORPORATING TERM EMBEDDINGS FOR MATCHING Use embeddings to generate suitable query expansions estimate relevance estimate relevance
  • 152. QUERY EXPANSION USING TERM EMBEDDINGS Use embeddings to generate suitable query expansions estimate relevance Find good expansion terms based on nearness in the embedding space Better retrieval performance when combined with pseudo-relevance feedback (PRF) [Zamani and Croft, 2016] and if we learn query specific term embeddings [Diaz et al., 2016] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016. Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608, 2016. Hamed Zamani and W Bruce Croft. Embedding-based query language models. In ICTIR, 2016.
  • 153. QUERY EXPANSION θq = UUTθq Query expansion using PRF U is the |v| x |D| term-document matrix Query expansion using word embeddings U is the |v| x k term embedding matrix Language model based IR score(d, q) = KL(θq, θd) where, θq is the query language model θd is the document language model
  • 154. Word2vec is the sriracha sauce of deep learning! “
  • 155. BUT A GOOD CHEF… Would prepare the appropriate sauce for each dish.
  • 156. GLOBAL VS. LOCAL EMBEDDINGS Local Global tax cutting deficit squeeze vote reduce budget slash reduction reduction house spend bill lower plan halve spend soften billion freeze Nearest neighbors of the word “cut” (as in “gas cut”) in the embedding space Uglobal  embedding trained with P(w|C) Ulocal  embedding trained with P(w|R) Where, C is the whole document corpus R is the set of relevant documents only
  • 157. QUERY EXPANSION USING GLOBAL AND LOCAL WORD EMBEDDINGS Each point represents a candidate expansion term Red points have high frequency in the relevant set of documents White points have low or no frequency in the relevant set of documents The blue point represents the query. Contours indicate distance from the query global local Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.
  • 158. DEEP NEURAL METHODS FOR RANKING 90 MINS (14:50 PM - 16:20 PM)
  • 159. SEMANTIC HASHING Document autoencoder minimizing reconstruction error Input: word counts (vocab size = 2K) Output: binary vector Stacked RBMs w/ layer-by-layer pre- training followed by E2E tuning Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.
  • 160. DEEP SEMANTIC SIMILARITY MODEL (DSSM) Siamese network trained E2E on query and document title pairs Relevance is estimated by cosine similarity between query and document embeddings Input: character trigraph counts (bag of words assumption) Minimizes cross-entropy loss against randomly sampled negative documents Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
  • 161. CONVOLUTIONAL DSSM (CDSSM) Replace bag-of-words assumption by concatenating term vectors in a sequence on the input Convolution followed by global max-pooling Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
  • 162. REMEMBER… …how different embedding spaces capture different notions of similarity?
  • 163. DSSM TRAINED ON DIFFERENT TYPES OF DATA Trained on pairs of… Sample training data Useful for? Paper Query and document titles <“things to do in seattle”, “seattle tourist attractions”> Document ranking (Shen et al., 2014) https://dl.acm.org/citation... Query prefix and suffix <“things to do in”, “seattle”> Query auto-completion (Mitra and Craswell, 2015) https://dl.acm.org/citation... Consecutive queries in user sessions <“things to do in seattle”, “space needle”> Next query suggestion (Mitra, 2015) https://dl.acm.org/citation... Each model captures a different notion of similarity (or regularity) in the learnt embedding space Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015. Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
  • 164. Nearest neighbors for “seattle” and “taylor swift” based on two DSSM models – one trained on query-document pairs and the other trained on query prefix-suffix pairs DIFFERENT REGULARITIES IN DIFFERENT EMBEDDING SPACES Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.
  • 165. DIFFERENT REGULARITIES IN DIFFERENT EMBEDDING SPACES Groups of similar search intent transitions from a query log The DSSM trained on session query pairs can capture regularities in the query space (similar to word2vec for terms) Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
  • 166. DSSM TRAINED ON SESSION QUERY PAIRS ALLOWS FOR ANALOGIES OVER SHORT TEXT! Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.
  • 167. INTERACTION-BASED NETWORKS Typically a document is relevant if some part of the document contains information relevant to the query Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing the ith window over query terms with the jth window over the document terms—captures evidence of relevance from different parts of the document Additional neural network layers can inspect the interaction matrix and aggregate the evidence to estimate overall relevance Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.
  • 168. REMEMBER… …the important of incorporating exact term matches as well as matches in the latent space for estimating relevance?
  • 169. LEXICAL AND SEMANTIC MATCHING NETWORKS Mitra et al. [2016] argue that both lexical and semantic matching is important for document ranking Duet model is a linear combination of two DNNs—focusing on lexical and semantic matching, respectively—jointly trained on labelled data Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 170. LEXICAL AND SEMANTIC MATCHING NETWORKS Lexical sub-model operates over input matrix 𝑋 𝑥𝑖,𝑗 = 1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 In relevant documents, 1. Many matches, typically in clusters 2. Matches localized early in document 3. Matches for all query terms 4. In-order (phrasal) matches Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 171. LEXICAL AND SEMANTIC MATCHING NETWORKS Convolve using window of size 𝑛 𝑑 × 1 Each window instance compares a query term w/ whole document Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 172. LEXICAL AND SEMANTIC MATCHING NETWORKS Semantic sub-model matches in the latent embedding space Match query with moving windows over document Learn text embeddings specifically for the task Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 173. BIG VS. SMALL DATA REGIMES Big data seems to be more crucial for models that focus on good representation learning for text Partial supervision strategies (e.g., unsupervised pre-training of word embeddings) can be effective but may be leaving the bigger gains on the table Learning to train on unlabeled data may be key to making progress on neural ad-hoc retrieval Which IR models are similar? Clustering based on query level retrieval performance. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 174. Duet implementation on PyTorch https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb GET THE CODE
  • 175. WIDE AND DEEP MODEL Deep model for representation learning and wide model for memorization Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.
  • 176. KERNEL POOLING Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.
  • 177. WEB DOCUMENTS ARE MORE THAN JUST BODY TEXT… URL incoming anchor text title body clicked query
  • 178. EXTENDING NEURAL RANKING MODELS TO MULTIPLE DOCUMENT FIELDS BM25 Neural ranking model → → BM25F ?
  • 179. RANKING DOCUMENTS WITH MULTIPLE FIELDS Learn different embedding space for each document field Different fields may match different aspects of the query—learn different query embeddings for matching against different fields Represent per field match by a vector, not a score Field level dropout during training can regularize against over-dependency on any individual field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 180. Learn a different embedding space for each document field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 181. For multiple-instance fields, average pool the instance level embeddings Mask empty text instances, and average only among non-empty instances to avoid preferring documents with more instances Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 182. Learn different query embeddings for matching against different fields Different fields may match different aspects of the query Ideal query representation for matching against URL likely to be different from for matching with title Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 183. Represent per field match by a vector, not a score Allows the model to validate that across the different fields all aspects of the query intent have been covered (Similar intuition as BM25F) Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 184. Aggregate evidence of relevance across all document fields Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 185. High precision fields, such as clicked queries, can negatively impact the modeling of the other fields Field level dropout during training can regularize against over-dependency on any individual field Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
  • 186. MANY OTHER NEURAL ARCHITECTURES (Palangi et al., 2015) (Kalchbrenner et al., 2014) (Denil et al., 2014) (Kim, 2014) (Severyn and Moschitti, 2015) (Zhao et al., 2015) (Hu et al., 2014) (Tai et al., 2015) (Guo et al., 2016) (Hui et al., 2017) (Pang et al., 2017) (Jaech et al., 2017) (Dehghani et al., 2017)
  • 187. BERT FOR RANKING BERT (and other large-scale unsupervised language models) are demonstrating dramatic performance improvements on many IR tasks Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019. MS MARCO Query Passage Pair Query Passage score
  • 188. Impact across both academia and industry BERT FOR RANKING
  • 189. WHAT DID YOUR MODEL REALLY LEARN? While we celebrate the recent performance bumps on IR tasks from neural methods, it is also important to recognize when and how they fail
  • 190. Clever Hans was a horse claimed to have been capable of performing arithmetic and other intellectual tasks. "If the eighth day of the month comes on a Tuesday, what is the date of the following Friday?“ Hans would answer by tapping his hoof. In fact, the horse was purported to have been responding directly to involuntary cues in the body language of the human trainer, who had the faculties to solve each problem. The trainer was entirely unaware that he was providing such cues. (source: Wikipedia)
  • 191. BM25 vs. Inverse document frequency of terms( ) BERT Language model of term co-occurrences( ) What corpus statistics does your model depend on?
  • 192. WHAT CHANGED BETWEEN TRAIN AND TEST? Terms often change meaning across domains or over time Robust retrieval performance is important (e.g., enterprise search across multiple tenants) TodayRecentIn older (1990s) TREC data Query: uk prime minister
  • 193. domain A domain B domain C domain X training domains test domain OPTIMIZING FOR CROSS DOMAIN PERFORMANCE
  • 194. Baseline model projects query and document to latent space for matching Additional fully-connected layers to estimate relevance Hidden layers may encode domain specific statistics convolution and pooling layers convolution and pooling layers hadamard product dense layers 𝑦 query doc How do we encourage the model to only learn features that generalize across multiple domains? OPTIMIZING FOR CROSS DOMAIN PERFORMANCE Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
  • 195. OPTIMIZING FOR CROSS DOMAIN PERFORMANCE Train model on multiple domains During training, an adversarial discriminator inspects the hidden states of the model and tries to predict the source corpus of the training sample convolution and pooling layers convolution and pooling layers hadamard product dense layers adversarial discriminator (dense) 𝑧 𝑦 query doc The duet model, in addition to optimizing for the ranking loss, also tries to “fool” the adversarial discriminator – and in the process learns more domain independent representations Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
  • 196. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.
  • 197. query relevant document non-relevant document parameters of the adversarial discriminator parameters of the ranking model Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS
  • 198. Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. ADDITIONAL REGULARIZATION FOR THE RANKING LOSS
  • 199. Reverse the gradient from the discriminator when back-propagating through the ranking model convolution and pooling layers convolution and pooling layers hadamard product dense layers adversarial discriminator (dense) 𝑧 𝑦 query doc ≈ ≈ Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018. GRADIENT REVERSAL
  • 200. Adversarial regularization may also be useful for mitigating bias
  • 201. MARRYING THE OLD WITH THE NEW
  • 202. (SIGIR ’94) (SIGIR ’04) (SIGIR ’05)
  • 205. CONNECTION TO NEURAL RANKER TRAINING Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
  • 206. AXIOMATIC REGULARIZATION FOR NEURAL RANKER Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.
  • 207. BREAK 10 MINS (16:20 PM - 16:30 PM)
  • 209. RETRIEVING, NOT JUST RERANKING, WITH DEEP NEURAL NETWORKS Deep ranking models are compute- intensive and are practically employed only to rerank top-k candidates retrieved by more efficient traditional IR methods IR performances may be significantly more impacted if we can also use them for candidate generation score
  • 210. OPTION 1: QUERY INDEPENDENT DOCUMENT REPRESENTATION Employ a Siamese network architecture Compute document representations offline and query representation at inference time Efficient online but large offline computation cost Effectiveness degrades without interaction features and lexical term matching score
  • 211. APPROXIMATE K-NN SEARCH Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. Off the beaten path: Let's replace term-based retrieval with k-nn search. In CIKM, 2016.
  • 212. LEARNING SPARSE VECTOR REPRESENTATIONS Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009. Hamed Zamani, et al. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In CIKM, 2018.
  • 213. FAST APPROX. K-NN SEARCH WITH ANNOY https://github.com/spotify/annoy
  • 214. Efficient online but large offline computation cost Can scale to tail queries but at higher computation cost—we can trade-off the two experimentally OPTION 2: ASSUME QUERY TERM INDEPENDENCE ASSUMPTION Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 215. WE TYPICALLY LEARN THE PARAMETERS OF THE MODEL BY MINIMIZING SOME PAIRWISE LOSS… e.g., Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 216. Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 217. TERM-DOCUMENT IMPACT SCORES d1 d2 d3 d4 d5 t1 t2 t3 t4 t5 If the IR model assumes query term independence, we can precompute all term-document impact scores The matrix is generally sparse, either by definition or by enforcing additional sparsity constraints (e.g., assume only terms that appear in the document have non-zero impact) Precomputed scores can be used with inverted index for fast retrieval from large collections Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 218. NEURAL RANKING MODEL WITH QUERY TERM INDEPENDENCE ASSUMPTION score Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 219. Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 220. THE EFFECTIVENESS-EFFICIENCY TRADE-OFF The model does not have the context of the full query which may result in reduced effectiveness However, now we can precompute everything and use the learned model in a full retrieval setting! Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.
  • 221. HOW SERIOUS IS THE LOSS IN EFFECTIVENESS FROM ASSUMING QUERY TERM INDEPENDENCE? Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019. Reranking evaluation Full retrieval evaluation (on a smaller set of queries than previous table)
  • 222. DOCUMENT EXPANSION Similar to query term independence approach in that they are both trying to learn a better document language model The training objective here, however, is to predict relevant queries and not the target ranking metric we care about Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document Expansion by Query Prediction. In arXiv, 2019.
  • 223. TRADING-OFF SEARCH RESULT QUALITY AND QUERY RESPONSE TIME IN LARGE SCALE IR SYSTEMS In Bing, we have a candidate generation stage followed by multiple rank and prune stages Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 224. In Bing, the index is distributed over multiple machines For candidate generation, on each machine the documents are linearly scanned using a match plan Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 225. When a query comes in, it is automatically categorized and a pre-defined match plan is selected A match plan consists of a sequence of match rules, and corresponding stopping criteria A match rule defines the condition that a document should satisfy to be selected as a candidate The stopping criteria decides when the index scan using a particular match rule should terminate—and if the matching process should continue with the next match rule, or conclude, or reset to the beginning of the index Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 226. Match plans influence the trade-off between effectiveness and efficiency E.g., long queries with rare intents may require expensive match plans that consider body text and search deeper into the index In contrast, for popular navigational queries a shallow scan against URL and title metastreams may be sufficient Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 227. E.g., Query: halloween costumes Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T ) Query: facebook login Match rule: mrB → (facebook ∈ U|T ) Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 228. During execution, two accumulators are tracked u: the number of blocks accessed from disk v: the cum. number of term matches in all inspected documents A stopping criteria sets thresholds for each – when either thresholds are met, the scan using that particular match rule terminates Matching may then continue with a new match rule, or terminate, or re-start from beginning Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 229. TYPICALLY THESE MATCH PLANS ARE HAND- CRAFTED AND STATICALLY ASSIGNED TO DIFFERENT QUERY CATEGORIES WE CAN CAST THE MATCH PLANNING AS A REINFORCEMENT LEARNING TASK! Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 230. REINFORCEMENT LEARNING environment action reward agent state Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 231. (for Bing candidate generation) index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 232. (for Bing candidate generation) Learn a policy πθ : S → A which maximizes the cumulative discounted reward R Where, γ is the discount rate index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 233. (for Bing candidate generation) We use table based Q learning State space: discrete <ut, vt> Action space: index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 234. (for Bing candidate generation) Reward function: g(di) is the relevance of the ith document estimated based on the subsequent L1 ranker score— considering only top n documents index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 235. (for Bing candidate generation) Final reward: If no new documents are selected, we assign a small negative reward index match rule relevance discounted by index blocks accessed agent accumulators (u, v) REINFORCEMENT LEARNING Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 236. RESULTS Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.
  • 237. DEEP LEARNING @ TREC 15 MINS (17:15 PM - 17:30 PM)
  • 238. GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA 200K queries, human-labeled, proprietary Past: Weak supervision Here: Two new datasetsPast: Proprietary data 1+M queries, weak supervision, open 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 Dehghani, Zamani, Severyn, Kamps and Croft. Neural ranking models with weak supervision. SIGIR 2017 More data Bettersearchresults TREC 2019 Deep Learning Track
  • 239. GENERATING PUBLIC BENCHMARKS FOR NEURAL IR RESEARCH A public retrieval and ranking benchmark with large scale training data (~400K queries with manual relevance labels)
  • 240. DERIVING OUR TREC 2019 DATASETS MS MARCO QnA Leaderboard • 1M real queries • 10 passages per Q • Human annotation says ~1 of 10 answers the query MS MARCO Passage Retrieval Leaderboard • Corpus: Union of 10-passage sets • Labels: From the ~1 positive passage TREC 2019 Task: Passage Retrieval • Same corpus, training Q+labels • New reusable NIST test set TREC 2019 Task: Document Retrieval • Corpus: Documents (crawl passage urls) • Labels: Transfer from passage to doc • New reusable NIST test set http://msmarco.org https://microsoft.github.io/TREC-2019-Deep-Learning/
  • 241. SETUP OF THE 2019 DEEP LEARNING TRACK • Key question: What works best in a large-data regime? • “nnlm”: Runs that use a BERT-style language model • “nn”: Runs that do representation learning • “trad”: Runs using only traditional IR features (such as BM25 and RM3) • Subtasks: • “fullrank”: End-to-end retrieval • “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25. Task Training data Test data Corpus 1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents 2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages * Mostly-overlapping query sets (41 shared)
  • 242. DATASET AVAILABILITY • Corpus + train + dev data for both tasks available now from the DL Track site* • NIST test sets available to participants now • [Broader availability in Feb 2020] * https://microsoft.github.io/TREC-2019-Deep-Learning/
  • 243. SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS
  • 244.
  • 245. Download the slides: http://bit.ly/dl4search-fire2019 Download the free book: http://bit.ly/neuralir-intro Download TREC Deep Learning Track data: https://microsoft.github.io/TREC-2019-Deep-Learning/ @UnderdogGeek bmitra@microsoft.com THANK YOU

Editor's Notes

  1. Local representation Distributed representation One dimension for “banana” “banana” is a pattern Brittle under noise More robust to noise Precise Near “mango”, “pineapple”. (Nuanced) Add vocab  Add dimensions Add vocab  Generate more vectors K dimensions  K items K dimensions  2k “regions”
  2. Clever Hans was a horse. It was claimed that he could do simple arithmetic. If you asked Hans a question he would respond by tapping his hoof. After a thorough investigation, it was, however, determined that what Clever Hans was really good at was at reading very subtle and, in fact, unintentional clues that his trainer was giving him via his body language. Hans didn’t know arithmetic at all. But he was very good at spotting body language that CORRELATED highly with the right answer.
  3. A traditional IR model, such as BM25, makes very few assumptions about the target collection. You can argue that the inverse document frequencies (and couple of the BM25 hyper-parameters) are all that you would learn from your collection. Which is why you can throw BM25 at most retrieval task (e.g., TREC or Web ranking in Bing) and it would give you pretty reasonable performance in most cases out-of-the-box. On the other hand, take a deep neural model and train it on Bing Web ranking task and then evaluate it on TREC data and I bet it falls flat on its face.
  4. This is an important problem. Think about an enterprise search solution that needs to cater to a large number of tenants. You train your model on only a few tenants—either because of privacy constraints or because most tenants are too small and you don’t have enough training data for the others. But afterwards you need to deploy the same model to all the tenants. Good cross domain performance would be key in such a setting. How can we make these deep and large machine learning models—with all their lovely bells and whistles—as robust as a simple BM25 baseline?