Deep Learning for Search

Deep Learning for Search
(Alternative title: “Neural Information Retrieval”)
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD Student, University College London
@UnderdogGeek bmitra@microsoft.com
Background image modified from source: https://commons.wikimedia.org/wiki/File:Howrah_Bridge_from_the_western_bank_of_the_Ganges.jpg

I am an applied researcher at Bing.
Based in Microsoft Research Montreal.
Previously worked for Microsoft in
Hyderabad, Seattle, and Cambridge.
Part-time PhD candidate at University
College London. My research is on
neural methods for information retrieval.
Originally born and grew up in Kolkata.
Bing
UCL
MSR
Cambridge
MSR Montreal
MSR
Montreal
MSR
Cambridge
UCL
works for
used to be
based in
based
in
doing
PhD at

Neural Information
Retrieval (or neural IR)
is the application of
shallow or deep neural
networks to IR tasks.

THE STATE OF NEURAL INFORMATION RETRIEVAL
GROWING PUBLICATION POPULARITY
AT TOP IR CONFERENCES
STRONG PERFORMANCE AGAINST
TRADITIONAL METHODS IN TREC 2019

Download the slides:
http://bit.ly/dl4search-fire2019
Download the free book:
http://bit.ly/neuralir-intro
Download TREC Deep Learning Track data:
https://microsoft.github.io/TREC-2019-Deep-Learning/
RESOURCES

AGENDA
Let’s focus on the fundamentals!
Please feel free to interrupt and
ask lots of questions!

THE SEARCH TASK
10 MINS
(10:05 AM - 10:15 AM)

INFORMATION RETRIEVAL (IR)
User has an information need
There exists a collection of information resources
IR is the activity of retrieving the information
resources relevant to the information need

EXAMPLE OF AN IR TASK
(WEB SEARCH)
User expresses information need as a short
textual query
The search engine retrieves top relevant web
documents as information resources
We will use web search as the main example of
an IR task in the rest of this lecture
query
Information
need
retrieval system indexes a
document corpus
results ranking (document list)
Relevance
(documents satisfy
information need)

DESIDERATA
Decades of IR research has
identified some key factors that text
retrieval models should consider
Traditional IR models typically
incorporate one of more of these
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness

DESIDERATA
A document that contains more occurrences of the
query term(s) is more likely to be relevant
Tip: consider term frequency (TF)
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over

DESIDERATA
A rare term (e.g., “msmarco”) is likely to be more
informative than a common term (e.g., “and”)
Tip: consider inverse document frequency (IDF)
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
more informative than

DESIDERATA
A term should not contribute disproportionately
Increase in TF should have larger impact for smaller TFs
Tip: put a saturation function over the TF
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over

DESIDERATA
A document containing more non-relevant terms is
likely to be less relevant
Tip: perform document length normalization
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over

DESIDERATA
A document containing query terms in close proximity is
likely to be more relevant than one where the terms
occur far away from each other
Tip: consider proximity features
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over

DESIDERATA
Term matches earlier in the document may indicate more
likelihood of the document being relevant
Tip: consider position of term matches
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
preferable over

DESIDERATA
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
uk prime minister
The query and the document may refer to the same
concept using different vocabularies
Tip: consider expanding the query or document, or
matching query and document terms in a latent space
theresa may

DESIDERATA
Term frequency
Term weighting
Term saturation
Document length
Term proximity
Term position
Vocabulary mismatch
Term aboutness
albuquerque
By inspecting other terms in the document we may infer
if the document is about the query term
Tip: consider expanding the query or matching the
query terms with the document terms in a latent space
Passage about Albuquerque Passage not about Albuquerque

EXAMPLES OF RANKING METRICS
Discounted Cumulative Gain (DCG)
𝐷𝐶𝐺@𝑘 =
𝑖=1
𝑘
2 𝑟𝑒𝑙𝑖
− 1
𝑙𝑜𝑔2 𝑖 + 1
Reciprocal Rank (RR)
𝑅𝑅@𝑘 = max
1<𝑖<𝑘
𝑟𝑒𝑙𝑖
𝑖

FUNDAMENTALS OF
NEURAL NETWORKS
30 MINS
(10:15 AM - 10:45 AM)

NEURAL
NETWORKS
Chains of parameterized linear transforms (e.g., multiply weight,
add bias) followed by non-linear functions (σ)
Popular choices for σ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU

FUNDAMENTAL MACHINE LEARNING TASKS

SQUARED LOSS
The squared loss is a popular loss function for regression tasks

THE SOFTMAX FUNCTION
In neural classification models, the softmax function is popularly used to normalize
the neural network output scores across all the classes

CROSS ENTROPY
The cross entropy between two probability
distributions 𝑝 and 𝑞 over a discrete set of
events is given by,
If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all
other values of 𝑖 then,

CROSS ENTROPY WITH
SOFTMAX LOSS
Cross entropy with softmax is a popular loss
function for classification

Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕𝑙
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
STOCHASTIC GRADIENT DESCENT (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat

𝜕𝑙
𝜕𝑤1
=
𝜕 𝑦 − 𝑦2
2
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑦1
𝜕𝑤1
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝜕𝑤1
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤1. 𝑥 + 𝑏1 × 𝑥
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
𝑥 𝑦1 𝑦2
𝑙
𝑦 − 𝑦2
2
𝑦
…and repeat

COMPUTATION
NETWORKS
The “Lego” approach to specifying DNN architectures
Library of computation nodes, each node defines logic for:
1. Forward pass: compute output given input
2. Backward pass: compute gradient of loss w.r.t. inputs,
given gradient of loss w.r.t. outputs
3. Parameter gradient: compute gradient of loss w.r.t.
parameters, given gradient of loss w.r.t. outputs
Chain nodes to create bigger and more complex networks

TOOLKITS
A diverse set of options
to choose from!
Figure from https://towardsdatascience.com/battle-of-
the-deep-learning-frameworks-part-i-cff0e3841750

TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH
First, we define the model
architecture
Next, we specify loss function and
optimization algorithm
Finally, loop over training data to
optimize model parameters
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py

REALLY DEEP
NEURAL NETWORKS
(Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)

WHY ADDING DEPTH HELPS
http://playground.tensorflow.org

can’t separate using a linear model!
Input features
Label
surface kerberos book library
1 0 1 0 ✓
1 1 0 0 ✗
0 1 0 1 ✓
0 0 1 1 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
VISUAL
MOTIVATION FOR
HIDDEN UNITS
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗

VISUAL
MOTIVATION FOR
HIDDEN UNITS
Or more succinctly…
Input features Hidden layer
Label
surface kerberos book library H1 H2
1 0 1 0 1 0 ✓
1 1 0 0 0 0 ✗
0 1 0 1 0 1 ✓
0 0 1 1 0 0 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
can separate using a linear model!
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗

WHY ADDING DEPTH HELPS
Deeper networks can split the input space
in many (non-independent) linear regions
than shallow networks
Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
THE LOTTERY
TICKET HYPOTHESIS

BIAS-VARIANCE TRADE-OFF IN THE
DEEP LEARNING ERA
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.

LEARNING TO RANK
35 MINS
(10:45 AM - 11:20 AM)

MOST IR SYSTEMS PRESENT
RANKED LISTS OF RETRIEVED
INFORMATION ARTIFACTS

THE UNREASONABLE EFFECTIVENESS
OF SIMPLE LTR BASED APPROACHES

LEARNING TO
RANK (LTR)
”... the task to automatically construct a ranking
model using training data, such that the model
can sort new objects according to their degrees
of relevance, preference, or importance.”
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf

LEARNING TO
RANK (LTR)
L2R models represent a rankable item—e.g.,
a document—given some context—e.g., a
user-issued query—as a numerical vector
𝑥 ∈ ℝ 𝑛
The ranking model 𝑓: 𝑥 → ℝ is trained to
map the vector to a real-valued score such
that relevant items are scored higher.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf

WHY IS RANKING CHALLENGING?
Rank based metrics, such as DCG or MRR, are
non-smooth / non-differentiable

APPROACHES
Pointwise approach
Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑.
Pairwise approach
Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCG—difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:

FEATURES
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights

FEATURES
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010

POINTWISE
OBJECTIVES
Regression loss
Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑
e.g., square loss for binary or categorical
labels,
where, 𝑦 𝑞,𝑑 is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1

POINTWISE
OBJECTIVES
Classification loss
Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑
e.g., cross-entropy with softmax over
categorical labels 𝑌 [Li et al., 2008],
where, 𝑠 𝑦 𝑞,𝑑
is the model’s score for label 𝑦 𝑞,𝑑
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.

PAIRWISE
OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009],
where, 𝜙 can be,
• Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000]
• Exponential function 𝜙 𝑧 = 𝑒−𝑧
[Freund et al., 2003]
• Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧
[Burges et al., 2005]
• Others…
Pairwise loss minimizes the average number of
inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is
ranked higher than 𝑑𝑖
Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document
For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 ,
Feature vectors: 𝑥𝑖 and 𝑥𝑗
Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.

PAIRWISE
OBJECTIVES
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]—an industry favourite
[Burges, 2015]
Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡
𝑒 𝛾.𝑠 𝑖
𝑒 𝛾.𝑠 𝑖 +𝑒
𝛾.𝑠 𝑗
=
1
1+𝑒
−𝛾. 𝑠 𝑖−𝑠 𝑗
Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0
Computing cross-entropy between 𝑝 and 𝑝
ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.

A GENERALIZED CROSS-ENTROPY LOSS
An alternative loss function assumes a single relevant document 𝑑+ and compares it
against the full collection 𝐷
Predicted probabilities: p 𝑑+|𝑞 =
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
The cross-entropy loss is then given by,
ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
Computing the softmax over the full collection is prohibitively expensive—LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.

Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]

LISTWISE
OBJECTIVES
Burges et al. [2006] make two observations:
1. To train a model we don’t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents

LISTWISE
OBJECTIVES
According to the Luce model [Luce, 2005], given
four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of
observing a particular rank-order, say
𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by:
where, 𝜋 is a particular permutation and 𝜙 is a
transformation (e.g., linear, exponential, or
sigmoid) over the score 𝑠𝑖 corresponding to item
𝑑𝑖
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.

LISTWISE
OBJECTIVES
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a “smooth” rank of
documents as a function of their scores
This “smooth” rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss

BREAK
10 MINS
(11:20 AM - 11:30 AM)

EMBEDDINGS
45 MINS
(11:30 AM - 12:15 PM)

TYPES OF VECTOR REPRESENTATIONS
Local (or one-hot) representation
Every term in vocabulary T is represented by a
binary vector of length |T|, where one position in
the vector is set to one and the rest to zero
Distributed representation
Every term in vocabulary T is represented by a
real-valued vector of length k. The vector can be
sparse or dense. The vector dimensions may be
observed (e.g., hand-crafted features) or latent
(e.g., embedding dimensions).

Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984

OBSERVED (OR EXPLICIT)
DISTRIBUTED
REPRESENTATIONS
The choice of features is a key consideration
The distributional hypothesis states that
terms that are used (or occur) in similar
context tend to be semantically similar
[Harris, 1954]
Firth [1957] famously purported this idea of
distributional semantics by stating “a word is
characterized by the company it keeps”.
Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, p. 11. Blackwell, Oxford.
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010.

MINOR NOTE: SPOT THE DIFFERENCE!
DISTRIBUTED REPRESENTATION
Vector representations of items as
combinations of different features
or dimensions (as opposed to
one-hot)
DISTRIBUTIONAL SEMANTICS
Linguistic items with similar
distributions (e.g. context words)
have similar meanings
http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/

EXAMPLE: TERM-CONTEXT VECTOR SPACE
T: vocabulary, C: set of contexts, S: sparse matrix |T| x |C|
(PPMI: Positive Pointwise Mutual Information)
C0 c1 c2 … cj … c|C|
t0
t1
t2
…
ti Sij
…
t|T|
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010
t
t
t
t
t
t t
t
t

EXAMPLE: SALTON’S VECTOR SPACE
D: collection, T: vocabulary, S: sparse matrix |D| x |T|
t0 t1 t2 … tj … t|T|
d0
d1
d2
…
di Sij
…
d|D|
S
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975
idf

NOTIONS OF
SIMILARITY
Two terms are similar if their feature
vectors are close
But different feature spaces may capture
different notions of similarity
Is Seattle more similar to…
Sydney (similar type)
or
Seahawks (similar topic)
Depends on your choice of features

NOTIONS OF
SIMILARITY
Consider the following toy corpus…
Now consider the different vector
representations of terms you can derive
from this corpus and how the items that
are similar differ in these vector spaces

NOTIONS OF
SIMILARITY
Topical or Syntagmatic similarity

NOTIONS OF
SIMILARITY
Typical or Paradigmatic similarity

NOTIONS OF
SIMILARITY
A mix of Topical and Typical similarity

RETRIEVAL USING VECTOR REPRESENTATIONS
Map both query and candidate documents
into the same vector space
Retrieve documents closest to the query
e.g., using Salton’s vector space model
Where, 𝑣 𝑞 and 𝑣 𝑑 are vectors of TF-IDF scores
over all terms in the vocabulary
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, Nov. 1975
𝑠𝑖𝑚 𝑞, 𝑑 =
𝑣 𝑞. 𝑣 𝑑
𝑣 𝑞 . 𝑣 𝑑

REGULARITIES IN OBSERVED FEATURE SPACES
Some feature spaces capture
interesting linguistic regularities
e.g., simple vector algebra in the
term-neighboring term space may
be useful for word analogy tasks
Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014

EMBEDDINGS
An embedding is a representation of items in
a new space such that the properties of, and
the relationships between, the items are
preserved from the original representation.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.

EMBEDDINGS
e.g., 200-dimensional term embedding for “banana”

EMBEDDINGS
Compared to observed feature spaces:
• Embeddings typically have fewer dimensions
• The space may have more disentangled principle
components
• The dimensions may be less interpretable
• The latent representations may generalize better

What’s the advantage of
latent vector spaces over
observed features spaces?

LET’S TAKE AN IR
EXAMPLE
In Salton’s vector space, both
these passages are equidistant
from the query “Albuquerque”
A latent feature representation
may put the first passage closer to
the query because of terms like
“population” and “area”
Passage about Albuquerque
Passage not about Albuquerque
Query: “Albuquerque”

HOW TO LEARN TERM EMBEDDINGS?
Multiple approaches have been
proposed for learning embeddings
from <term, context, count> data
Popular approaches include matrix
factorization or stochastic gradient
descent (SGD)
C0 c1 c2 … cj … c|C|
t0
t1
t2
…
ti Xij
…
t|T|

LATENT SEMANTIC ANALYSIS (LSA)
Perform SVD on X to obtain
its low-rank approximation
Involves finding a solution to
X = 𝑈Σ𝑉T
The embedding for the ith
term is given by Σk 𝑡𝑖
Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.

Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 1990.
LATENT SEMANTIC ANALYSIS (LSA)

WORD2VEC
Goal: simple (shallow) neural model
learning from billion words scale corpus
Predict middle word from neighbors
within a fixed size context window
Two different architectures:
1. Skip-gram
2. CBOW
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.

SKIP-GRAM
Predict neighbor 𝑡𝑖+𝑗 given term 𝑡𝑖

THE SKIP-GRAM LOSS
S is the set of all windows over the training text
c is the number of neighbours we need to predict on either side of the term 𝑡𝑖
Full softmax is computationally impractical - hierarchical softmax or negative sampling is employed instead

CONTINUOUS
BAG-OF-WORDS
(CBOW)
Predict the middle term 𝑡𝑖 given
{𝑡𝑖−𝑐, … , 𝑡𝑖−1, 𝑡𝑖+1, … , 𝑡𝑖+𝑐}

THE CBOW LOSS
Note: from every window of text skip-gram generates 2 x c training samples
whereas CBOW generates one – that’s why CBOW trains faster than skip-gram

WORD ANALOGIES
WITH WORD2VEC
W2v is popular for word analogy tasks
But remember the same relationships also
exist in the observed feature space, as we
saw earlier

Let 𝑥𝑖𝑗 be the frequency of the pair 𝑡𝑖, 𝑡𝑗 in
the training data, then
t0 t1 t2 … tj … t|T|
t0
t1
t2
…
ti Xij
…
t|T|
A MATRIX INTERPRETATION OF WORD2VEC
cross-entropy error
actual co-occurrence
probability
predicted co-occurrence
probability

Replace the cross-entropy error
with a squared-error and apply a
saturation function f(…) over 𝑥𝑖𝑗
GLOVE
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
ℒ 𝐺𝑙𝑜𝑉𝑒 = −
𝑖=1
|𝑇|
𝑗=1
|𝑇|
𝑓 𝑥𝑖,𝑗 𝑙𝑜𝑔 𝑥𝑖,𝑗 − 𝑤𝑖
⊺
𝑤𝑗
2
squared error
predicted co-occurrence
probability
saturation function
actual co-occurrence
probability`

PARAGRAPH2VEC
W2v style model where context is
document, not neighboring term
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.

RECAP: HOW TO LEARN TERM EMBEDDINGS?
Learn from <term, context, count> data
Choice of context (e.g., neighboring term or container document) defines what
relationship you are modeling
Choice of learning algorithm (e.g., matrix factorization or SGD) defines
how well you model the relationship
Choice of context and learning algorithm are independent – you can use
matrix factorization with neighboring term context, or a w2v-style neural
network with document context (e.g., paragraph2vec)

EXAMPLES OF TEXT EMBEDDINGS
Embedding for Source Item Target Item Learning Model
Latent Semantic Analysis
Deerwester et. al. (1990)
Single word
Word
(one-hot)
Document
(one-hot)
Matrix factorization
Word2vec
Mikolov et. al. (2013)
Single Word
Word
(one-hot)
Neighboring Word
(one-hot)
Neural Network (Shallow)
Glove
Pennington et. al. (2014)
Single Word
Word
(one-hot)
Neighboring Word
(one-hot)
Matrix factorization
Semantic Hashing (auto-encoder)
Salakhutdinov and Hinton (2007)
Multi-word text
Document
(bag-of-words)
Same as source
(bag-of-words)
Neural Network (Deep)
DSSM
Huang et. al. (2013), Shen et. al. (2014)
Multi-word text
Query text
(bag-of-trigrams)
Document title
(bag-of-trigrams)
Session DSSM
Mitra (2015)
Multi-word text
Query text
(bag-of-trigrams)
Next query in session
(bag-of-trigrams)
Language Model DSSM
Mitra and Craswell (2015)
Multi-word text
Query prefix
(bag-of-trigrams)
Query suffix
(bag-of-trigrams)

DEEP NEURAL
NETWORKS
45 MINS
(12:15 PM - 13:00 PM)

DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION

LET’S TALK (BRIEFLY) ABOUT
SUPERVISION FOR LEARNING
TEXT EMBEDDINGS WITH DNNS
Supervised approach
Ideal if sufficiently labeled training data is available for the target
retrieval task
Unsupervised approach
E.g., training an auto-encoder or a language model on unlabeled
corpus
Hybrid approach
Current state-of-the-art results have employed large-scale
unsupervised pretraining—followed by sufficiently large-scale
supervised fine-tuning towards the target task

SIAMESE NETWORK
Supervised model trained on 𝑞, 𝑑1, 𝑑2 where 𝑑1is relevant to
q, but 𝑑2 is non-relevant
Logistic loss is popularly used—think RankNet where
𝑠𝑖𝑚 𝑣 𝑞, 𝑣 𝑑 is the model score
Typically both left and right models share similar architectures,
but may also choose to share the learnable parameters

AUTOENCODER
Unsupervised models trained to minimize
reconstruction errors
Information Bottleneck method (Tishby et al., 1999)
The bottleneck layer 𝑥 captures “minimal sufficient
statistics” of 𝑣 and is a compressed representation of
the same

LANGUAGE MODELING
A family of language modeling tasks have been
explored in the literature, including:
• Predict next word in a sequence
• Predict masked word in a sequence
• Predict next sentence
Fundamentally the same idea as word2vec and older
neural LMs—but with deeper models and considering
dependencies across longer distances between terms
w1 [MASK]w2 w4
model
?
loss
w3

SHIFT-INVARIANT
NEURAL OPERATIONS
Detecting a pattern in one part of the input space is similar to
detecting it in another
Leverage redundancy by moving a window over the whole
input space and then aggregate
On each instance of the window a kernel—also known as a
filter or a cell—is applied
Different aggregation strategies lead to different architectures

CONVOLUTION
Move the window over the input space each time applying the
same cell over the window
A typical cell operation can be,
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels]
Cell Output [1 x out_channels]
Full Output [1 + (words – window) / stride x out_channels]

POOLING
Move the window over the input space each time applying an
aggregate function over each dimension in within the window
ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 + (words – window) / stride x channels]
max -pooling average -pooling

CONVOLUTION W/
GLOBAL POOLING
Stacking a global pooling layer on top of a convolutional layer
is a common strategy for generating a fixed length embedding
for a variable length text
Full Output [1 x out_channels]

RECURRENCE
Similar to a convolution layer but additional dependency on
previous hidden state
A simple cell operation shown below but others like LSTM and
GRUs are more popular in practice,
ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏
Cell Input [window x in_channels] + [1 x out_channels]
Cell Output [1 x out_channels]

RECURSIVE OR TREE-
RNN
Shared weights among all the levels of the tree
Cell can be an LSTM or as simple as
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 x channels]

ATTENTION
Given a set of n items and an input context, produce a
probability distribution {a1, …, ai, …, an} of attending to each item
as a function of similarity between a learned representation (q)
of the context and learned representations (ki) of the items
𝑎𝑖 =
𝜑 𝑞, 𝑘𝑖
𝑗
𝑛
𝜑 𝑞, 𝑘𝑗
The aggregated output is given by 𝑖
𝑛
𝑎𝑖 ∙ 𝑣𝑖
Full Input [words x in_channels], [1 x ctx_channels]
* When attending over a sequence (and not a set), the key k and value
v are typically a function of the item and some encoding of the position

SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Output [words x out_channels]

RESIDUAL NETWORKS
Enabled training of really
deep architectures (up to
152 layers)
Each layer learns the
residual functions with
reference to the layer inputs
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Andreas Veit, Michael J. Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 2016.

TRANSFORMERS
A transformer layer consists of a combination of self-
attention layer and multiple fully-connected or
convolutional layers, with residual connections
A transformer-based encoder can consist of multiple
transformers stacked in sequence
Full Output [words x out_channels]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

NORMALIZATION
Internal covariate shift refers to the
changing distribution of each layer’s
inputs during training, as the parameters
of the previous layers change
BatchNorm and other normalization
techniques achieve better training
effectiveness by addressing this problem
Sergey Ioffe, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
Image source: https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/

REGULARIZATION
The process of adding information
in order to prevent overfitting.
Popular approaches:
• Dropout
• Regularization loss
• Early stopping

CONTEXTUALIZED
DEEP WORD
EMBEDDINGS
http://jalammar.github.io/illustrated-bert/
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.

BERT
Stacked transformer layers
Pretrained on two tasks:
• Masked language modeling
• Next sentence prediction
Input: WordPiece embedding +
position embedding + segment
embedding
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.

LUNCH
60 MINS
(13:00 PM - 14:00 PM)

SHALLOW NEURAL
METHODS FOR RANKING
50 MINS
(14:00 PM - 14:50 PM)

THERE IS A LONG HISTORY OF
VECTOR SPACE MODELS (BOTH
DENSE AND SPARSE) IN IR
Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. In CACM, 1975.
Scott Deerwester, et. al. Indexing by latent semantic analysis. In JASIST, 1990.
Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. In IJAR, 2009.

RETRIEVAL USING VECTOR REPRESENTATIONS
Generate vector
representation of query
Generate vector
representation of document
Estimate relevance from q-d
vectors

Compare query and document
directly in the embedding space
POPULAR APPROACHES TO INCORPORATING
TERM EMBEDDINGS FOR MATCHING
Use embeddings to generate
suitable query expansions
estimate relevance estimate relevance

E.g.,
Generalized Language Model [Ganguly et
al., 2015]
Neural Translation Language Model
[Zuccon et al., 2015]
Average term embeddings [Le and Mikolov,
2014, Nalisnick et al., 2016, Zamani and Croft, 2016,
and others]
Word mover’s distance [Kusner et al., 2015,
Guo et al., 2016]
estimate relevance

GENERALIZED LANGUAGE MODEL
Traditional language modeling based IR approach may estimate q-d relevance as follows,
where, 𝑝 𝑡 𝑞|𝑑 is the
probability of generating
term 𝑡 𝑞 from document 𝑑

Traditional language modeling based IR approach may estimate q-d relevance as follows,
𝑝 𝑡 𝑞|𝑑 and 𝑝 𝑡 𝑞|𝐷 are the
probabilities of randomly
sampling term 𝑡 𝑞 from document
𝑑 and the full collection 𝐷,
respectively
𝑝 𝑡 𝑞|𝐷 has a smoothing effect
on the 𝑝 𝑡 𝑞|𝑑 estimation

GLM includes additional smoothing based on term similarity in the embedding space
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth JF Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.

GLM includes additional smoothing based on term similarity in the embedding space
Probability of generating the
term from the document
based on similarity in the
embedding space
Probability of generating the term
from the full collection based on
similarity in the embedding space

NEURAL TRANSLATION LANGUAGE MODEL
Translation Language Model:
Neural Translation Language Model:
TLM estimates 𝑝 𝑡 𝑞|𝑡 𝑑 from q-d paired data
similar to statistical machine translation
NTLM uses term-term similarity in the
embedding space to estimate 𝑝 𝑡 𝑞|𝑡 𝑑
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.

AVERAGE TERM EMBEDDINGS
Q-D relevance
estimated by
computing cosine
similarity between
centroid of q and d
term embeddings
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving document ranking with dual word embeddings. In WWW, 2016.
Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.

WORD MOVER’S DISTANCE
Based on the Earth Mover’s Distance (EMD)
[Rubner et al., 1998]
Originally proposed by Wan et al. [2005, 2007],
but used WordNet and topic categories
Kusner et al. [2015] incorporated term
embeddings
Adapted for q-d matching by Guo et al. [2016]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In CV, 1998.
Xiaojun Wan and Yuxin Peng. The earth mover’s distance as a semantic measure for document similarity. In CIKM, 2005.
Xiaojun Wan. A novel document similarity measure based on earth mover’s distance. Information Sciences, 2007.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, 2015.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. Semantic matching by non-linear word transportation for information retrieval. In CIKM, 2016.

CHOICE OF TERM EMBEDDINGS
FOR DOCUMENT RANKING
RECAP: for the query “Albuquerque” the relevant
document may contain terms like “population” and “area”
Documents about “Santa Fe” not relevant for this query
“Albuquerque” ↔ “population” (Topically similar) ✓
“Albuquerque” ↔ “Santa Fe” (Typically similar) ✗
Standard LSA and para2vec capture topical similarity,
whereas w2v and GloVe capture a mix of both Top/Typ-ical
Passage about Albuquerque
Passage not about Albuquerque
Query: “Albuquerque”

DUAL EMBEDDING SPACE MODEL
What if I told you that everyone
using word2vec is throwing half
the model away?

IN-OUT captures a more
Topical notion of similarity
than IN-IN and OUT-OUT
Effect is exaggerated
when embeddings are
trained on short text (e.g.,
queries)

Average term embeddings model, but use IN embeddings for
query terms and OUT embeddings for document terms

IN+OUT Embeddings for 2.7M words
trained on 600M+ Bing queries
https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e
Download

RELEVANCE-BASED
WORD EMBEDDING
Goal: learn a word embedding that directly
models a topical notion of similarity
Given query q, predict term t sampled from
a smoothed language model (estimated
using PRF) for the query
Hamed Zamani and W. Bruce Croft. Relevance-based word embedding. In SIGIR, 2017.

A TALE OF TWO QUERIES
“PEKAROVIC LAND COMPANY”
Hard to learn good representation for
the rare term pekarovic
But easy to estimate relevance based on
count of exact term matches of
pekarovic in the document
“WHAT CHANNEL ARE THE
SEAHAWKS ON TODAY”
Target document likely contains ESPN
or sky sports instead of channel
The terms ESPN and channel can be
compared in a term embedding space
Matching in the term space is necessary to handle rare terms. Matching in the
latent embedding space can provide additional evidence of relevance. Best
performance is often achieved by combining matching in both vector spaces.

QUERY: CAMBRIDGE (Font size is a function of term-term cosine similarity)
Besides the term “Cambridge”, other related terms (e.g., “university”, “town”,
“population”, and “England”) contribute to the relevance of the passage

However, the same terms may also make a passage about Oxford look somewhat
relevant to the query “Cambridge”

A passage about giraffes, however, obviously looks non-relevant in the embedding
space…

But the embedding based matching model is more robust to the same passage when “giraffe” is
replaced by “Cambridge”—a trick that would fool exact term based IR models. In a sense, the
embedding based model ranks this passage low because Cambridge is not "an African even-
toed ungulate mammal“.

E.g.,
Generalized Language Model [Ganguly et
al., 2015]
Neural Translation Language Model
[Zuccon et al., 2015]
Average term embeddings [Le and Mikolov,
2014, Nalisnick et al., 2016, Zamani and Croft, 2016,
and others]
Word mover’s distance [Kusner et al., 2015,
Guo et al., 2016]
Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In ADCS, 2015.
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.
Hamed Zamani and W Bruce Croft. Estimating embedding vectors for queries. In ICTIR, 2016.
estimate relevance

QUERY EXPANSION USING
TERM EMBEDDINGS
Use embeddings to generate
suitable query expansions
estimate relevance
Find good expansion terms based on nearness in
the embedding space
Better retrieval performance when combined
with pseudo-relevance feedback (PRF) [Zamani and
Croft, 2016] and if we learn query specific term
embeddings [Diaz et al., 2016]
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608, 2016.
Hamed Zamani and W Bruce Croft. Embedding-based query language models. In ICTIR, 2016.

QUERY EXPANSION
θq = UUTθq
Query expansion using PRF
U is the |v| x |D| term-document matrix
Query expansion using word embeddings
U is the |v| x k term embedding matrix
Language model based IR
score(d, q) = KL(θq, θd)
where,
θq is the query language model
θd is the document language model

Word2vec is the sriracha sauce
of deep learning!
“

BUT A GOOD
CHEF…
Would prepare the
appropriate sauce for
each dish.

GLOBAL VS. LOCAL
EMBEDDINGS
Local Global
tax cutting
deficit squeeze
vote reduce
budget slash
reduction reduction
house spend
bill lower
plan halve
spend soften
billion freeze
Nearest neighbors of the word “cut” (as in “gas
cut”) in the embedding space
Uglobal  embedding trained with P(w|C)
Ulocal  embedding trained with P(w|R)
Where,
C is the whole document corpus
R is the set of relevant documents only

QUERY EXPANSION USING
GLOBAL AND LOCAL WORD
EMBEDDINGS
Each point represents a candidate expansion term
Red points have high frequency in the relevant set
of documents
White points have low or no frequency in the
relevant set of documents
The blue point represents the query.
Contours indicate distance from the query
global
local
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. In ACL, 2016.

DEEP NEURAL METHODS
FOR RANKING
90 MINS
(14:50 PM - 16:20 PM)

SEMANTIC
HASHING
Document autoencoder minimizing
reconstruction error
Input: word counts (vocab size = 2K)
Output: binary vector
Stacked RBMs w/ layer-by-layer pre-
training followed by E2E tuning

DEEP SEMANTIC
SIMILARITY
MODEL (DSSM)
Siamese network trained E2E on query and
document title pairs
Relevance is estimated by cosine similarity between
query and document embeddings
Input: character trigraph counts (bag of words
assumption)
Minimizes cross-entropy loss against randomly
sampled negative documents
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.

CONVOLUTIONAL
DSSM (CDSSM)
Replace bag-of-words assumption by concatenating
term vectors in a sequence on the input
Convolution followed by global max-pooling

REMEMBER…
…how different embedding
spaces capture different
notions of similarity?

DSSM TRAINED ON DIFFERENT TYPES OF DATA
Trained on pairs of… Sample training data Useful for? Paper
Query and document titles <“things to do in seattle”, “seattle tourist attractions”> Document ranking (Shen et al., 2014)
https://dl.acm.org/citation...
Query prefix and suffix <“things to do in”, “seattle”> Query auto-completion (Mitra and Craswell, 2015)
Consecutive queries in user
sessions
<“things to do in seattle”, “space needle”> Next query suggestion (Mitra, 2015)
Each model captures a different notion of similarity
(or regularity) in the learnt embedding space
Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.
Bhaskar Mitra. Exploring session context using distributed representations of queries and reformulations. In SIGIR, 2015.

Nearest neighbors for “seattle” and “taylor swift” based on two DSSM models
– one trained on query-document pairs and the other trained on query
prefix-suffix pairs
DIFFERENT REGULARITIES IN DIFFERENT
EMBEDDING SPACES
Bhaskar Mitra and Nick Craswell. Query auto-completion for rare prefixes. In CIKM, 2015.

DIFFERENT REGULARITIES IN DIFFERENT
EMBEDDING SPACES
Groups of similar search intent
transitions from a query log
The DSSM trained on session query pairs
can capture regularities in the query space
(similar to word2vec for terms)

DSSM TRAINED ON SESSION QUERY PAIRS
ALLOWS FOR ANALOGIES OVER SHORT TEXT!

INTERACTION-BASED
NETWORKS
Typically a document is relevant if some part of the
document contains information relevant to the query
Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing
the ith window over query terms with the jth window over the
document terms—captures evidence of relevance from
different parts of the document
Additional neural network layers can inspect the interaction
matrix and aggregate the evidence to estimate overall
relevance
Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.

REMEMBER…
…the important of
incorporating exact term
matches as well as matches
in the latent space for
estimating relevance?

LEXICAL AND SEMANTIC
MATCHING NETWORKS
Mitra et al. [2016] argue that both lexical and
semantic matching is important for
document ranking
Duet model is a linear combination of two
DNNs—focusing on lexical and semantic
matching, respectively—jointly trained on
labelled data

MATCHING NETWORKS
Lexical sub-model operates over input matrix 𝑋
𝑥𝑖,𝑗 =
1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
In relevant documents,
1. Many matches, typically in clusters
2. Matches localized early in document
3. Matches for all query terms
4. In-order (phrasal) matches

MATCHING NETWORKS
Convolve using window of size 𝑛 𝑑 × 1
Each window instance compares a query term w/
whole document

MATCHING NETWORKS
Semantic sub-model matches in the latent
embedding space
Match query with moving windows over document
Learn text embeddings specifically for the task

BIG VS. SMALL DATA
REGIMES
Big data seems to be more crucial for models that focus on
good representation learning for text
Partial supervision strategies (e.g., unsupervised pre-training of
word embeddings) can be effective but may be leaving the
bigger gains on the table
Learning to train on unlabeled data
may be key to making progress on
neural ad-hoc retrieval
Which IR models are similar?
Clustering based on query level
retrieval performance.

Duet implementation on PyTorch
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
GET THE CODE

WIDE AND DEEP
MODEL
Deep model for representation
learning and wide model for
memorization
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.

KERNEL POOLING
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.

WEB DOCUMENTS ARE MORE THAN
JUST BODY TEXT…
URL
incoming
anchor text
title
body
clicked query

EXTENDING NEURAL RANKING MODELS TO
MULTIPLE DOCUMENT FIELDS
BM25
Neural ranking model
→
→
BM25F
?

RANKING DOCUMENTS
WITH MULTIPLE FIELDS
Learn different embedding space for each
document field
Different fields may match different aspects of
the query—learn different query embeddings for
matching against different fields
Represent per field match by a vector, not a
score
Field level dropout during training can regularize
against over-dependency on any individual field
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.

Learn a different embedding space
for each document field

For multiple-instance fields,
average pool the instance level
embeddings
Mask empty text instances, and
average only among non-empty
instances to avoid preferring
documents with more instances

Learn different query
embeddings for matching
against different fields
Different fields may match
different aspects of the query
Ideal query representation for
matching against URL likely to be
different from for matching with
title

Represent per field match by a
vector, not a score
Allows the model to validate
that across the different fields all
aspects of the query intent have
been covered
(Similar intuition as BM25F)

Aggregate evidence of relevance
across all document fields

High precision fields, such as
clicked queries, can negatively
impact the modeling of the
other fields
Field level dropout during
training can regularize against
over-dependency on any
individual field

MANY OTHER NEURAL ARCHITECTURES
(Palangi et al., 2015)
(Kalchbrenner et al., 2014)
(Denil et al., 2014)
(Kim, 2014)
(Severyn and Moschitti, 2015)
(Zhao et al., 2015) (Hu et al., 2014)
(Tai et al., 2015)
(Guo et al., 2016)
(Hui et al., 2017)
(Pang et al., 2017)
(Jaech et al., 2017)
(Dehghani et al., 2017)

BERT FOR RANKING
BERT (and other large-scale unsupervised language models) are
demonstrating dramatic performance improvements on many IR tasks
Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019.
MS MARCO
Query Passage Pair
Query Passage
score

Impact across both academia and industry
BERT FOR RANKING

WHAT DID YOUR MODEL
REALLY LEARN?
While we celebrate the recent performance bumps on
IR tasks from neural methods, it is also important to
recognize when and how they fail

Clever Hans was a horse claimed to have been
capable of performing arithmetic and other
intellectual tasks.
"If the eighth day of the month comes on a
Tuesday, what is the date of the following Friday?“
Hans would answer by tapping his hoof.
In fact, the horse was purported to have been
responding directly to involuntary cues in the
body language of the human trainer, who had the
faculties to solve each problem. The trainer was
entirely unaware that he was providing such cues.
(source: Wikipedia)

BM25 vs.
Inverse document
frequency of terms( )
BERT
Language model of term
co-occurrences( )
What corpus statistics does your model depend on?

WHAT CHANGED
BETWEEN TRAIN
AND TEST?
Terms often change meaning
across domains or over time
Robust retrieval performance is
important (e.g., enterprise search
across multiple tenants)
TodayRecentIn older
(1990s)
TREC data
Query: uk prime minister

domain A domain B domain C domain X
training domains test domain
OPTIMIZING FOR CROSS DOMAIN PERFORMANCE

Baseline model projects
query and document to
latent space for matching
Additional fully-connected
layers to estimate relevance
Hidden layers may encode
domain specific statistics
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers 𝑦
query
doc
How do we encourage the model to only learn
features that generalize across multiple domains?
Daniel Cohen, Bhaskar Mitra, Katja Hofmann, and W. Bruce Croft. Cross domain regularization for neural ranking models using adversarial learning. In SIGIR, 2018.

Train model on multiple domains
During training, an adversarial
discriminator inspects the hidden
states of the model and tries to
predict the source corpus of the
training sample
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
The duet model, in addition to optimizing for the
ranking loss, also tries to “fool” the adversarial
discriminator – and in the process learns more
domain independent representations

ADDITIONAL REGULARIZATION FOR THE
RANKING LOSS

query
relevant
document
non-relevant
document
parameters of
the adversarial
discriminator
parameters of the
ranking model
RANKING LOSS

RANKING LOSS

Reverse the gradient from
the discriminator when
back-propagating through
the ranking model
convolution and
pooling layers
convolution and
pooling layers
hadamard
product
dense layers
adversarial discriminator (dense) 𝑧
𝑦
query
doc
≈ ≈
GRADIENT REVERSAL

Adversarial regularization
may also be useful for
mitigating bias

(SIGIR ’94) (SIGIR ’04) (SIGIR ’05)

source: https://www.eecis.udel.edu/~hfang/AX.html

CONNECTION TO NEURAL RANKER TRAINING
Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.

AXIOMATIC REGULARIZATION FOR NEURAL
RANKER
Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, and Saurabh Tiwary. An Axiomatic Approach to Regularizing Neural Ranking Models. In SIGIR, 2019.

BREAK
10 MINS
(16:20 PM - 16:30 PM)

BEYOND RERANKING
45 MINS
(16:30 PM - 17:15 PM)

RETRIEVING, NOT JUST RERANKING, WITH
DEEP NEURAL NETWORKS
Deep ranking models are compute-
intensive and are practically
employed only to rerank top-k
candidates retrieved by more
efficient traditional IR methods
IR performances may be significantly
more impacted if we can also use
them for candidate generation
score

OPTION 1: QUERY INDEPENDENT
DOCUMENT REPRESENTATION
Employ a Siamese network architecture
Compute document representations offline
and query representation at inference time
Efficient online but large offline
computation cost
Effectiveness degrades without interaction
features and lexical term matching
score

APPROXIMATE
K-NN SEARCH
Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. Off the beaten path: Let's replace term-based retrieval with k-nn search. In CIKM, 2016.

LEARNING SPARSE VECTOR REPRESENTATIONS
Hamed Zamani, et al. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In CIKM, 2018.

FAST APPROX. K-NN SEARCH WITH ANNOY
https://github.com/spotify/annoy

Efficient online but large offline
computation cost
Can scale to tail queries but at
higher computation cost—we can
trade-off the two experimentally
OPTION 2: ASSUME QUERY TERM
INDEPENDENCE ASSUMPTION
Bhaskar Mitra et al. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. In arXiv, 2019.

WE TYPICALLY LEARN THE PARAMETERS OF THE
MODEL BY MINIMIZING SOME PAIRWISE LOSS…
e.g.,

TERM-DOCUMENT
IMPACT SCORES
d1 d2 d3 d4 d5
t1
t2
t3
t4
t5
If the IR model assumes query term independence,
we can precompute all term-document impact scores
The matrix is generally sparse, either by definition or
by enforcing additional sparsity constraints
(e.g., assume only terms that appear in the document
have non-zero impact)
Precomputed scores can be used with inverted index
for fast retrieval from large collections

NEURAL RANKING MODEL WITH QUERY TERM
INDEPENDENCE ASSUMPTION
score

THE EFFECTIVENESS-EFFICIENCY TRADE-OFF
The model does not have
the context of the full query
which may result in reduced
effectiveness
However, now we can
precompute everything and
use the learned model in a
full retrieval setting!

HOW SERIOUS IS THE
LOSS IN EFFECTIVENESS
FROM ASSUMING QUERY
TERM INDEPENDENCE?
Reranking evaluation
Full retrieval evaluation
(on a smaller set of queries than previous table)

DOCUMENT
EXPANSION
Similar to query term
independence approach in that
they are both trying to learn a
better document language model
The training objective here,
however, is to predict relevant
queries and not the target ranking
metric we care about
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document Expansion by Query Prediction. In arXiv, 2019.

TRADING-OFF SEARCH RESULT QUALITY AND QUERY
RESPONSE TIME IN LARGE SCALE IR SYSTEMS
In Bing, we have a candidate generation stage followed by multiple rank and
prune stages
Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Optimizing query evaluations using reinforcement learning for web search. In SIGIR, 2018.

In Bing, the index is distributed over multiple machines
For candidate generation, on each machine the documents are linearly scanned using a match plan

When a query comes in, it is automatically
categorized and a pre-defined match plan is
selected
A match plan consists of a sequence of
match rules, and corresponding stopping
criteria
A match rule defines the condition that
a document should satisfy to be selected as
a candidate
The stopping criteria decides when the
index scan using a particular match rule
should terminate—and if the matching
process should continue with the next match
rule, or conclude, or reset to the beginning
of the index

Match plans influence the
trade-off between
effectiveness and efficiency
E.g., long queries with rare
intents may require expensive
match plans that consider
body text and search deeper
into the index
In contrast, for popular
navigational queries a shallow
scan against URL and title
metastreams may be sufficient

E.g.,
Query: halloween costumes
Match rule: mrA → (halloween ∈ A|U|B|T ) ∧ (costumes ∈ A|U|B|T )
Query: facebook login
Match rule: mrB → (facebook ∈ U|T )

During execution, two accumulators are tracked
u: the number of blocks accessed from disk
v: the cum. number of term matches in all inspected documents
A stopping criteria sets thresholds for each – when either thresholds are met, the scan using
that particular match rule terminates
Matching may then continue with a new match rule, or terminate, or re-start from beginning

TYPICALLY THESE MATCH PLANS ARE HAND-
CRAFTED AND STATICALLY ASSIGNED TO DIFFERENT
QUERY CATEGORIES
WE CAN CAST THE MATCH PLANNING AS A
REINFORCEMENT LEARNING TASK!

REINFORCEMENT
LEARNING
environment
action reward
agent
state

(for Bing candidate generation)
index
match rule relevance discounted by
index blocks accessed
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING

Learn a policy πθ : S → A which
maximizes the cumulative
discounted reward R
Where, γ is the discount rate
index
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING

We use table based Q learning
State space: discrete <ut, vt>
Action space:
index
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING

Reward function:
g(di) is the relevance of the ith
document estimated based on the
subsequent L1 ranker score—
considering only top n documents
index
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING

Final reward:
If no new documents are selected,
we assign a small negative reward
index
agent
accumulators
(u, v)
REINFORCEMENT
LEARNING

RESULTS

DEEP LEARNING @ TREC
15 MINS
(17:15 PM - 17:30 PM)

GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new datasetsPast: Proprietary data
1+M queries, weak supervision, open 300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
Dehghani, Zamani, Severyn, Kamps and Croft.
Neural ranking models with weak supervision.
SIGIR 2017
More data
Bettersearchresults
TREC 2019 Deep Learning Track

GENERATING PUBLIC BENCHMARKS FOR NEURAL IR
RESEARCH
A public retrieval and ranking benchmark
with large scale training data (~400K
queries with manual relevance labels)

DERIVING OUR TREC 2019 DATASETS
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 of 10
answers the query
MS MARCO Passage
Retrieval Leaderboard
• Corpus: Union of
10-passage sets
• Labels: From the
~1 positive passage
TREC 2019 Task:
Passage Retrieval
• Same corpus,
training Q+labels
• New reusable NIST
test set
TREC 2019 Task:
Document Retrieval
• Corpus:
Documents (crawl
passage urls)
• Labels: Transfer
from passage to
doc
• New reusable NIST
test set
http://msmarco.org

SETUP OF THE 2019 DEEP LEARNING TRACK
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BERT-style language model
• “nn”: Runs that do representation learning
• “trad”: Runs using only traditional IR features (such as BM25 and RM3)
• Subtasks:
• “fullrank”: End-to-end retrieval
• “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25.
Task Training data Test data Corpus
1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents
2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass
labels
8.8M passages
* Mostly-overlapping query sets (41 shared)

DATASET AVAILABILITY
• Corpus + train + dev data for both tasks
available now from the DL Track site*
• NIST test sets available to participants now
• [Broader availability in Feb 2020]
* https://microsoft.github.io/TREC-2019-Deep-Learning/

SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS

Download the slides:
http://bit.ly/dl4search-fire2019
Download the free book:
http://bit.ly/neuralir-intro
Download TREC Deep Learning Track data:
THANK YOU

Deep Learning for Search

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Deep Learning for Search

Similar to Deep Learning for Search (20)

More from Bhaskar Mitra

More from Bhaskar Mitra (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Search

Editor's Notes