SlideShare a Scribd company logo
1 of 126
WSDM 2017 Tutorial
Neural Text Embeddings for IR
Bhaskar Mitra and Nick Craswell
Download slides from: http://bit.ly/NeuIRTutorial-WSDM2017
Check out the full tutorial:
https://arxiv.org/abs/1705.01509
SIGIR papers with title words: Neural, Embedding,
Convolution, Recurrent, LSTM
1%
4%
8%
11%
20%
0%
5%
10%
15%
20%
25%
SIGIR 2014
(accepted)
SIGIR 2015
(accepted)
SIGIR 2016
(accepted)
SIGIR 2017
(submitted)
SIGIR 2018
(optimistic?)
Neural network papers @ SIGIR
Deep Learning
Amazingly successful on many hard
applied problems.
Dominating multiple fields:
(But also beware the hype.)
Christopher Manning.
Understanding Human Language:
Can NLP and Deep Learning Help?
Keynote SIGIR 2016 (slides 71,72)
0%
5%
10%
15%
20%
25%
SIGIR 2014
(accepted)
SIGIR 2015
(accepted)
SIGIR 2016
(accepted)
SIGIR 2017
(submitted)
SIGIR 2018
(optimistic?)
Neural network papers @ SIGIR
2011 2013 2015 2017
speech vision NLP IR
Neural methods for information retrieval
This tutorial mainly focuses on:
• Ranked retrieval of short and long texts, given a text query
• Shallow and deep neural networks
• Representation learning
For broader topics (multimedia, knowledge) see:
• Craswell, Croft, Guo, Mitra, and de Rijke. Neu-IR:
Workshop on Neural Information Retrieval. SIGIR 2016
workshop
• Hang Li and Zhengdong Lu. Deep Learning for
Information Retrieval. SIGIR 2016 tutorial
Today’s agenda
1. IR Fundamentals
2. Word embeddings
3. Word embeddings for IR
4. Deep neural nets
5. Deep neural nets for IR
6. Other applications
We cover key ideas,
rather than
exhaustively
describing all NN IR
ranking papers.
For a more complete overview see:
• Zhang et al. Neural information
retrieval: A literature review. 2016
• Onal et al. Getting started with
neural models for semantic
matching in web search. 2016
slides are available
here for download
Bhaskar Mitra
@UnderdogGeek
bmitra@microsoft.com
Nick Craswell
@nick_craswell
nickcr@microsoft.com
send us your questions and feedback
during or after the tutorial
Fundamentals of
Retrieval
Chapter 1
Information retrieval (IR) terminology
This tutorial: Using neural networks in the retrieval system to
improve relevance. For text queries and documents.
Information
need
query
results ranking (document list)
retrieval system indexes a
document corpus
Relevance
(documents satisfy
information need
e.g. useful)
IR applications
Document ranking Factoid question answering
Query Keywords Natural language question
Document Web page, news article A fact and supporting passage
TREC experiments TREC ad hoc TREC question answering
Evaluation metric Average precision, NDCG Mean reciprocal rank
Research solution Modern TREC rankers
BM25, query expansion, learning to rank, links,
clicks+likes (if available)
IBM@TREC-QA
Answer type detection, passage retrieval, relation
retrieval, answer processing and ranking
In products • Document rankers at: Google,
Bing, Baidu, Yandex, …
• Watson@Jeopardy
• Voice search
This NN tutorial Long text ranking Short text ranking
Challenges in (neural) IR [slide 1/3]
• Vocabulary mismatch
Q: How many people live in Sydney?
 Sydney’s population is 4.9 million
[relevant, but missing ‘people’ and ‘live’]
 Hundreds of people queueing for live music in Sydney
[irrelevant, and matching ‘people’ and ‘live’]
• Need to interpret words based on context (e.g., temporal)
Today Recent In older (1990s)
TREC data
query:
“uk prime minister”
Vocab mismatch:
• Worse for short texts
• Still an issue for long texts
Challenges in (neural) IR [slide 2/3]
• Q and D vary in length
• Models must handle
variable length input
• Relevant docs have
irrelevant sections
https://en.wikipedia.org/wiki/File:Size_distribution_among_Featured_Articles.png
Challenges in (neural) IR [slide 3/3]
• Need to learn Q-D relationship
that generalizes to the tail
• Unseen Q
• Unseen D
• Unseen information needs
• Unseen vocabulary
• Efficient retrieval over many
documents
• KD-Tree, LSH, inverted files, …
Figure from: Goel, Broder, Gabrilovich, and Pang. Anatomy of the long tail:
ordinary people with extraordinary tastes. WWW Conference 2010
Representation
of Words
Chapter 2
One-hot representation (local)
Dim = |V| sim(banana,mango) = 0
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0
1
banana
mango
Notes: 1) Popular sim() is cosine, 2) Words/tokens
come from some tokenization and transformation
Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984
Context-based distributed representation
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010
“You shall know a word
by the company it keeps
”
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In
Studies in Linguistic Analysis, p. 11. Blackwell, Oxford.
banana
mango
sim(banana,mango) > 0
Appear in same documents
Appear near same words
Non-zero Zero
banana
nanana#ba na# ban
banana
(grows) (tree)(yellow) (on) (africa)
banana
Doc7 Doc9Doc2
banana
(grows, +1) (tree, +3)(yellow, -1) (on, +2) (africa, +5)
Word-Document
Word-Word
Word-WordDist
Word hash
(not context-based)
“You shall know a word
by the company it keeps
”
DistributionalSemantics
Distributional methods use
distributed representations
Distributed and distributional
• Distributed representation:
Vector represents a concept as
a pattern, rather than 1-hot
• Distributional semantics:
Linguistic items with similar
distributions (e.g. context
words) have similar meanings
http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/
“You shall know a word
by the company it keeps
”
Vector Space
Models
For a given task: Choose matrix, choose 𝑠𝑖𝑗 weighting.
Could be binary, could be raw counts.
Example weighting:
Positive Pointwise Mutual Information (Word-Word matrix)
V: vocabulary, C: set of contexts, S: sparse matrix |V| x |C|
c0 c1 c2 … cj … c|C|
W0
W1
W2
…
Wi Sij
…
w|V|
Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010
PPMI weighting for Word-Word matrix
TF-IDF weighting for Word-Document matrix
Distributed word representation overview
• Next we cover lower-dimensional dense representations
• Including word2vec. Questions on advantage*
• But first we consider the effect of context choice, in the data itself
Data Counts matrix (sparse) Learning from counts matrix Learning from individual
instances
Word-Document Vector space models LSA Paragraph2Vec PV-DBOW
Word-Word Vector space models GloVe Word2vec
Word-WordDist Vector space models
context
* Baroni, Dinu and Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014
Levy, Goldberg and Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the ACL 3:211-225
Let’s consider the following example…
We have four (tiny) documents,
Document 1 : “seattle seahawks jerseys”
Document 2 : “seattle seahawks highlights”
Document 3 : “denver broncos jerseys”
Document 4 : “denver broncos highlights”
Using Word-Document context
seattle
Document 1 Document 3
Document 2 Document 4
seahawks
denver
broncos
similar
similar
This is Topical or Syntagmatic similarity.
Using Word-WordDist context
seattle
(seattle, -1) (denver, -1)
(seahawks, +1) (broncos, +1)
(jerseys, + 1)
(jerseys, + 2)
(highlights, +1)
(highlights, +2)
seahawks
denver
broncos
similar
similar
This is Typical or Paradigmatic Similarity.
Let’s consider another example…
“seattle map”
“seattle weather”
“seahawks jerseys”
“seahawks highlights”
“seattle seahawks wilson”
“seattle seahawks sherman”
“seattle seahawks browner”
“seattle seahawks lfedi”
“denver map”
“denver weather”
“broncos jerseys”
“broncos highlights”
“denver broncos lynch”
“denver broncos sanchez”
“denver broncos miller”
“denver broncos marshall”
With more (tiny) documents…
Using Word-Word context
seattle
seattle denver
seahawks broncos
jerseys
highlights
wilson
sherman
seahawks
denver
broncos
browner
lfedi
lynch
sanchez
miller
marshall
map
weather
similar
similar
1. Word-Word is less sparse than Word-Document (Yan et al., 2013)
2. A mix of topical and typical similarity (function of window size)
Paradigmatic vs Syntagmatic
Do we choose one axis or the other, or both, or something
else? Can we learn a general-purpose word representation that
solves all our IR problems? Or do we need several?
seattle seahawks richard sherman jersey
denver broncos trevor siemian jersey
paradigmatic (or typical) axis
syntagmatic (or topical) axis
Embeddings (distributed)
Dense. Dim = 200 (for example) banana
mango
dog
Latent Semantic
Analysis
V: vocabulary, D: set of documents, X: sparse matrix |V| x |D|
xij = TF-IDF
Singular value decomposition of X:
X = UΣVT
d0 d1 d2 … dj … d|D|
t0
t1
t2
…
ti xij
…
t|V|
Deerwester, Dumais, Furnas, Landauer and Harshman. Indexing by latent semantic analysis. JASIS 41, no. 6, 1990
Latent Semantic
Analysis
σ1, …, σl: singular values
u1, …, ul: left singular vectors
v1, …, vl: right singular vectors
The k largest singular values, and corresponding
singular vectors from U and V, is the rank k
approximation of X
Xk = UkΣkVk
T
Embedding of ith term = Σk ti
Source: en.wikipedia.org/wiki/Latent_semantic_analysis
^
Dumais. Latent semantic indexing (LSI): TREC-3 report. NIST Special Publication SP (1995): 219-219.
Word2vec
Goal: simple (shallow) neural model
learning from billion words scale corpus
Predict middle word from neighbors within
a fixed size context window
Two different architectures:
1. Skip-gram
2. CBOW
(Mikolov et al., 2013)
Skip-gram
Predict neighbor 𝑤𝑡+𝑗 given word 𝑤𝑡
Maximizes following average log prob. WIN WOUT
wt
wt+j
ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 =
1
𝑇
𝑡=1
𝑇
−𝑐≤𝑗≤𝑐,𝑗≠0
log 𝑝 𝑤𝑡+𝑗|𝑤𝑡
𝑝 𝑤𝑡+𝑗|𝑤𝑡 =
𝑒𝑥𝑝 𝑊𝑂𝑈𝑇 𝑤𝑡+𝑗
⊺
𝑊𝐼𝑁 𝑤𝑡
𝑣=1
𝑉
𝑒𝑥𝑝 𝑊𝑂𝑈𝑇 𝑤𝑣
⊺ 𝑊𝐼𝑁 𝑤𝑡
Full softmax is computationally impractical. Word2vec
uses hierarchical softmax or negative sampling instead.
(Mikolov et al., 2013)
Continuous
Bag-of-Words
Predict word given bag-of-neighbors
Modify the skip-gram loss function.
ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 =
1
𝑇
𝑡=1
𝑇
log 𝑝 𝑤𝑡|𝑤𝑡∗
𝑤𝑡∗ =
−𝑐≤𝑗≤𝑐,𝑗≠0
𝑤𝑡+𝑗
WIN WOUT
wt+2wt+1
wt
wt-2wt-1
wt*
(Mikolov et al., 2013)
Word analogies
with word2vec
[king] – [man] + [woman] ≈ [queen]
(Mikolov et al., 2013)
Word analogies can work in underlying data too
seattle
seattle denver
seahawks broncos
jerseys
highlights
wilson
sherman
seahawks
denver
broncos
similar
browner
lfedi
lynch
sanchez
miller
marshall
[seahawks] – [seattle] + [Denver]
map
weather
Sparse vectors can work well for an analogy task:
Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014
A Matrix Interpretation of word2vec
Skip-gram looks like this:
If we aggregate over all training samples…
ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = −
𝐴 𝐵
log 𝑝 𝐵|𝐴
ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = −
𝑖=1
𝑉
𝑗=1
𝑉
𝑥𝑖,𝑗 log 𝑝 𝑤𝑖|𝑤𝑗
= −
𝑖=1
𝑉
𝑥𝑖
𝑗=1
𝑉
𝑋𝑖,𝑗
𝑋𝑖
log 𝑝 𝑤𝑖|𝑤𝑗
=
𝑖=1
𝑉
𝑥𝑖 𝐻 𝑃 𝑤𝑖|𝑤𝑗 , 𝑝 𝑤𝑖|𝑤𝑗
(Pennington et al., 2014)
= −
𝑖=1
𝑉
𝑥𝑖
𝑗=1
𝑉
𝑃 𝑤𝑖|𝑤𝑗 log 𝑝 𝑤𝑖|𝑤𝑗
cross-entropyactual co-occurence
probability
co-occurence probability
predicted by the model
w0 w1 w2 … wj … w|V|
w0
w1
w2
…
wi xij
…
w|V|
GloVe
Variety of windows sizes and weighting
AdaGrad
w0 w1 w2 … wj … w|V|
w0
w1
w2
…
wi Xij
…
w|V|
(Pennington et al., 2014)
ℒ 𝐺𝑙𝑜𝑉𝑒 = −
𝑖=1
𝑉
𝑗=1
𝑉
𝑓 𝑋𝑖,𝑗 𝑙𝑜𝑔𝑋𝑖,𝑗 − 𝑤𝑖
⊺
𝑤𝑗
2
squared erroractual co-occurence
probability`
co-occurence probability
predicted by the model
weighting functionweighting function
actual co-occurence
probability`
weighting function
Discussion of word representations
• Representations: Weighted counts, SVD, neural embedding
• Use similar data (W-D, W-W), capture similar semantics
• For example, analogies using counts
• Think sparse, act dense
• Stochastic gradient descent allows us to scale to larger data
• On individual instances (w2v) or count matrix data (GloVe)
• Scalability could be the key advantage for neural word embeddings
• Choice of data affects semantics
• Paradigmatic vs Syntagmatic
• Which is best for your application?
Word
Embeddings for
IR
Chapter 3
Traditional IR feature design
• Inverse document frequency • Term frequency
• Adjust TF model for doc length
n
R
N
r
Robertson and Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, no. 4
(2009)
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10 12 14 16 18 20
ProbabilityMass
Term Frequency of t in D
D not about t
D is about t
Robertson and Sparck-Jones (1977) Harter (1975)
2-Poisson model of TF
RSJ naïve Bayes
model of IDF
Rank D by: 𝑡∈𝑄 𝑇𝐹 𝑡, 𝐷 ∗ 𝐼𝐷𝐹(𝑡)
How to incorporate embeddings
A. Extend traditional IR models
• Term weighting, language model smoothing, translation of vocab
B. Expand query using embeddings (followed by non-neural IR)
• Add words similar to the query
C. IR models that work in the embedding space
• Centroid distance, word mover’s distance
Term weighting using word embeddings
𝑦 =
𝑟
𝑅
(term recall)
• Fraction of positive docs with t
• The 𝑟 and 𝑅 were missing in RSJ
𝒙 = 𝒕 − 𝒒
• 𝒕 is embedding of t
• 𝒒 is centroid of all query terms
• Weight TF-IDF using 𝑦
Zheng and Callan. Learning to reweight terms with distributed representations. SIGIR 2015
𝒒
𝒕
Traditional IR model: Query likelihood
• Language modeling approach to IR is quite extensible
p 𝑑 𝑞 ≈ 𝑝 𝑞 𝑑 𝑝 𝑑
• p 𝑑 can be assumed uniform across docs
• p 𝑞 𝑑 = 𝑤∈𝑞 𝑝(𝑤|𝑑) depends on modeling the document
• Frequent words in 𝑑 are more likely (term frequency)
• Smoothing according to the corpus (plays the role of IDF)
• Various ways of dealing with document length normalization
Ponte and Croft. A language modeling approach to information retrieval. SIGIR 1998
Zhai and Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR 2001
Generalized
Language Model
Ganguly, Roy, Mitra, and Jones. Word embedding based generalized language model for information retrieval. SIGIR 2015
Compares query term with
every document term
Neural Language
Translation Model
Zuccon, Koopman, Bruza, and Azzopardi. "Integrating and evaluating neural word embeddings in
information retrieval." Australasian Document Computing Symposium 2015
Translation probability from document
term u to query term w
Considers all query-document
term pairs
Based on: Berger and Lafferty. "Information retrieval as statistical translation." SIGIR 1999
B. Query expansion
Both passages have same number
of gold query matches.
Yet non-query green matches can
be a good evidence of aboutness.
We need methods to consider
non-query terms. Traditionally:
automatic query expansion.
Example from Mitra et al., 2016
Query: Albuquerque
Query expansion using w2v
Identify expansion terms using w2v cosine similarity
3 different strategies: pre-retrieval, post-retrieval, and pre-retrieval
incremental
Beats no expansion, but does not beat non-neural expansion
Roy, Paul, Mitra and Garain. Using Word embeddings for automatic query expansion. Neu-IR Workshop 2016
Query expansion using w2v
• Related paper. Distance 𝛿 has a sigmoid transform. First model:
• Their second model uses the first round of retrieval
• Can beat non-neural expansion
Zamani and Croft. Embedding-based query language models. International Conference on the Theory of Information Retrieval 2016
Lavrenko and Croft. Relevance based language models. SIGIR 2001
Optimizing the query vector
Zamani and Croft. Estimating embedding vectors for queries. International Conference on the Theory of Information Retrieval 2016
Global vs. local
embedding spaces
Train w2v on documents
from first round of retrieval
Fine-grained word sense
disambiguation
A large number of
embedding spaces can be
cached in practice
(Diaz et al., 2016)
C. IR models in the embedding space
• Q: Bag of word vectors
• D: Bag of word vectors
• Considerations:
• Deal with variable length of Q and D
• Match all pairs? Do some alignment of Q and D?
Comparing short texts
• Weighted semantic network
• Related to word alignment
• Each word in longer text is
connected to its most similar
• BM25-like edge weighting
• Generates features for
supervised learning of short
text similarity
Kenter and de Rijke. Short text similarity with word embeddings. CIKM 2015
Word Mover’s Distance
Kusner, Sun, Kolkin and Weinberger. From Word Embeddings To Document Distances. ICML 2015
See also: Huang, Guo, Kusner, Sun, Weinberger and Sha. Supervised Word Mover’s Distance. NIPS 2016
Sparse normalized TF vectors: 𝒅 = 𝒅′ = 1
Sparse flow matrix: 𝑗 𝑇𝑖𝑗 = 𝑑𝑖 𝑖 𝑇𝑖𝑗 = 𝑑′𝑗
Minimize cost: 𝑖,𝑗 𝑇𝑖𝑗 𝑐(𝑖, 𝑗)  word2vec
Bounds on word mover’s distance
WCD <= RWMD <= WMD
WCD: Word centroid distance
RWMD: Relaxed WMD
Prefetch and prune algorithm:
• Sort by WCD
• Compute WMD on top-k
• Continue, using RWMD to
prune 95% of WMD
Kusner, Sun, Kolkin and Weinberger. From Word Embeddings To Document Distances. ICML 2015
Centroid and related techniques
• Goal: Bag of vectors  Single fixed size vector
• Centroid, weighted centroid (or sum [Vulic and Moens, 2015])
• Aggregate using a Fisher Kernel
• Clinchant and Perronnin. Aggregating continuous word embeddings for information retrieval.
(ACL) Workshop on Continuous Vector Space Models and their Compositionality, 2013
• Using Fisher Kernel from Jaakkola and Haussler (1999)
What if I told you everyone using w2v is
throwing half the model away?
Two sets of embeddings are
trained (WIN and WOUT)
But WOUT is generally discarded
IN-OUT dot product captures log
prob. of co-occurrence
WIN WOUT
Wt+j
wt
(Mitra et al., 2016)
Notions of
Similarity
IN-OUT captures more
Topical notion of similarity
than IN-IN and OUT-OUT
Effect is exaggerated when
embeddings are trained on
short text (e.g., queries)
(Mitra et al., 2016)
Dual Embedding
Space Model
Compare query terms with every document term
Equivalent to taking centroid of all term vectors
Combine with traditional retrieval model
(Mitra et al., 2016)
Query and document representation:
Centroid of all words, including repetition
Telescoping
• DESM also works well when “telescoping” i.e. reranking a small set
• For example, DESM can confuse Oxford and Cambridge
• Bing rarely makes the Oxford-Cambridge mistake
• The upstream ranker (Bing) saves DESM from making that mistake
• If DESM sees more documents, it may make the mistake
Matveeva, Burges, Burkard, Laucius, and Wong. High accuracy retrieval with multiple nested ranker. SIGIR 2006
Because Cambridge is not
"an African even-toed ungulate mammal"
https://github.com/bmitra-msft/Demos/blob/master/notebooks/DESM.ipynb
Paragraph2vec
PV-DM
A mix of LSA and and w2v style
training data, less sparse than
training just on paragraph ID
PV-DBOW
Effectively modelling the same
relationship as LSA (factorizing
word-document matrix)
Le and Mikolov. Distributed Representations of Sentences and Documents. ICML 2014
Models of
similar flavour
Grbovic, Djuric, Radosavljevic, Silvestri and Bhamidipati.
Context-and content-aware embeddings for query rewriting in sponsored search.
SIGIR 2015.
Sun, Guo, Lan, Xu and Cheng
Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations
ACL 2015
Adapting PV-DBOW for IR
Negative sampling using IDF instead
of corpus frequency
L2 regularization constraint on the
norm to avoid over-fitting on short
documents
Reduce sparseness by predicting
context words (similar to PV-DM?)
Ai, Yang, Guo and Croft. Analysis of the Paragraph Vector Model for Information Retrieval. ICTIR 2016
Discussion of embeddings in IR
• The right mix of lexical and semantic matching is important
• In non-telescoping settings, need to combine with lexical IR
• Many IR models do some sort of vector comparison of Q-D words
• Open question: Which notion of similarity is best for each IR
model
• Typical vs Topical vs some mix
• These methods seem promising if:
• High-quality embeddings domain-appropriate embeddings available
• No large-scale supervised IR data available
• If large-scale supervised IR data is available… (after the break)
Deep Neural
Networks
Chapter 4
A primer on
neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (σ)
Popular choices for σ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backward
pass
Expected output
loss
Tanh ReLU
Visual motivation
for hidden units
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
can’t separate using a linear model!
Or more succinctly…
Input features
Label
surface kerberos book library
1 0 1 0 ✓
1 1 0 0 ✗
0 1 0 1 ✓
0 0 1 1 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
Visual motivation
for hidden units
Or more succinctly…
Input features Hidden layer
Label
surface kerberos book library H1 H2
1 0 1 0 1 0 ✓
1 1 0 0 0 0 ✗
0 1 0 1 0 1 ✓
0 0 1 1 0 0 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
can separate using a linear model!
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
Why adding depth helps
Deeper networks can split the input
space in many (non-independent)
linear regions than shallow networks
Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014
Why adding depth helps
http://playground.tensorflow.org
Neural models for text
But how do you feed text to a neural network?
Dogs have owners, cats have staff.
Local representations of input text
Char-level models
d o g s h a v e o w n e r s c a t s h a v e s t a f f
one-hotvectors
concatenate
channels
[chars x channels]
Local representations of input text
Word-level models w/ bag-of-chars per word
d o g s h a v e o w n e r s c a t s h a v e s t a f f
one-hotvectors
concatenate
sum sum sum sum sum sum
channels
[words x channels]
# d o g s # # h a v e # # o w n e r s # # c a t s # # h a v e # # s t a f f #
Local representations of input text
Word-level models w/ bag-of-trigraphs per word
one-hotvectors
concatenate or sum
sum sum sum sum sum sum
channels
[words x channels] or [1 x channels]
Local representations of input text
Word-level models w/ pre-trained embeddings
d o g s h a v e o w n e r s c a t s h a v e s t a f f
pre-trained
embeddings
concatenate or sum
channels
[words x channels] or [1 x channels]
Shift-invariant
neural operations
Detecting a pattern in one part of input space is same as
detecting it in another
(also applies to sequential inputs, and inputs with dims >2)
Leverage redundancy by operating on a moving window
over whole input space and then aggregate
The repeatable operation is called a kernel, filter, or cell
Aggregation strategies leads to different architectures
Convolution
Move the window over the input space each time
applying the same cell over the window
A typical cell operation can be,
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels]
Cell Output [1 x out_channels]
Full Output [1 + (words – window) / stride x out_channels]
𝑋
ℎ output
Pooling
Move the window over the input space each time
applying an aggregate function over each dimension in
within the window
ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 + (words – window) / stride x channels]
𝑋
ℎ outputmax -pooling average -pooling
Convolution w/
Global Pooling
Stacking a global pooling layer on top of a convolutional
layer is a common strategy for generating a fixed length
embedding for a variable length text
Full Input [words x in_channels]
Full Output [1 x out_channels]
convolutionpooling
output
Recurrent
Neural Network
Similar to a convolution layer but additional dependency
on previous hidden state
A simple cell operation shown below but others like
LSTM and GRUs are more popular in practice,
ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels] + [1 x out_channels]
Cell Output [1 x out_channels]
Full Output [1 x out_channels]
𝑋𝑖
ℎ𝑖ℎ𝑖−1
output
Recursive NN
or Tree-RNN
Shared weights among all a levels of the tree
Cell can be an LSTM or as simple as
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 x channels]
output
Autoencoders
Unsupervised models trained to minimize
reconstruction errors
Information Bottleneck method (Tishby et al., 1999)
The bottleneck layer captures “minimal sufficient
statistics” of X and is a compressed representation of
the input
Image source: https://en.wikipedia.org/wiki/Autoencoder
Computation
Networks
The “Lego” approach to specifying DNN architectures
Library of computation nodes, each node defines logic for:
1. Forward pass: compute output given input
2. Backward pass: compute gradient of loss w.r.t. inputs, given
gradient of loss w.r.t. outputs
3. Parameter gradient: compute gradient of loss w.r.t.
parameters, given gradient of loss w.r.t. outputs
Chain nodes to create bigger and more complex networks
Really Deep
Neural Networks
(Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
DNNs for IR
Chapter 6
Taxonomy of Models
Neural networks can be used to:
1. Generate query representation,
2. Generate document representation, and/or
3. Estimate relevance.
This can involve pre-trained representations or
end-to-end training or some combination of
approaches.
Query text
Generate query
representation
Doc text
Generate doc
representation
Estimate relevance
Query
vector
Doc
vector
Point of query
representation
Point of match
Point of doc
representation
You’ve already seen…
Query text
Query expansion
using embeddings
Doc text
Generate doc
term vector
Query likelihood
Query
term vector
Doc
term vector
Query text
Generate query
embedding
Doc text
Generate doc
embedding
Cosine similarity
Query
embedding
Doc
embedding
e.g., DESM (Mitra et al., 2016) e.g., Roy et al., 2016 and Diaz
et al, 2016
What does ad-hoc
IR data look like?
search queries document corpus
user interaction / click data human annotated labels
Traditional IR uses human labels as
ground truth for evaluation
So ideally we want to train our
ranking models on human labels
User interaction data is rich but may
contain different biases compared to
human annotated labels
search queries document corpus
user interaction / click data human annotated labels
Traditional IR uses human labels as
ground truth for evaluation
So ideally we want to train our
ranking models on human labels
User interaction data is rich but may
contain different biases compared to
human annotated labels
What does ad-hoc
IR data look like?
search queries document corpus
user interaction / click data human annotated labels
In industry:
Document corpus: billions?
Query corpus: many billions?
Labelled data w/ raw text: hundreds of
thousands of queries
Labelled data w/ learning-to-rank style
features: same as above
User interaction data: billions?
What does ad-hoc
IR data look like?
In academia:
Document corpus: few billion
Query corpus: few million but other
sources (e.g., wiki titles) can be used
Labelled data w/ raw text: few hundred to
few thousand queries
Labelled data w/ learning-to-rank style
features: tens of thousands of queries
search queries document corpus
human annotated labels
What does ad-hoc
IR data look like?
A pessimistic summary…
“
”
Water, water, everywhere,
Nor any drop to drink.^
Levels of Supervision (or the optimistic view)
Unsupervised
• Train embeddings on
unlabeled corpus and
use in traditional IR
models
• E.g., GLM, NTLM, DESM,
Semantic hashing
Semi-supervised
• DNN models using pre-
trained embeddings for
input text representation
• E.g., DRMM,
MatchPyramid
Fully supervised
• DNNs w/ raw text input
(one-hot word vectors
or n-graph vectors)
trained on labels or click
• E.g., Duet
*Of course, ad-hoc retrieval isn’t everything. A lot of recent DNN for
IR literature focuses on other tasks such as short-text similarity or
QnA, where enough training data is available.
We covered
most of these
Semantic
hashing
Autoencoder minimizing document
reconstruction error
Vocab: 2K popular words (stopwords removed)
In/out = word counts and binary hidden units
Stacked RBMs w/ layer-by-layer pre-training
followed by E2E tuning
Class labels based eval, no relevance judgments
(Salakhutdinov and Hinton, 2007)
Deep Semantic
Similarity Model
Siamese network trained E2E on query and
document title pairs
Relevance is estimated by cosine similarity
between query and document embeddings
Input: character trigraph counts (bag of words
assumption)
Minimizes cross-entropy loss against randomly
sampled negative documents
(Huang et al., 2013)
Convolutional
DSSM
Replace bag-of-words assumption by concatenating term
vectors in a sequence on the input
(Shen et al., 2014)
Convolution followed by
global max-pooling
Performance improves
further by including more
negative samples during
training (50 vs. 4)
Rich space of neural architectures
(Palangi et al., 2015)
(Kalchbrenner et al., 2014)
(Denil et al., 2014)
(Kim, 2014)
(Severyn and Moschitti, 2015)
(Tai et al., 2015) (Zhao et al., 2015) (Hu et al., 2014)
Rich space of neural architectures
(Kalchbrenner et al., 2014)
Input representation based on pre-trained word
embeddings
Wide vs. narrow convolution (presence/absence of
zero padding)
Dynamic k max-pooling, where k is a function of the
number of convolutional layers and sentence length
Evaluated on non-IR tasks (sentiment prediction and
question-type classification)
Dynamic Convolutional
Neural Network
Rich space of neural architectures
Uses the DCNN model to learn document embeddings
First DCNN on word embeddings to get sentence embedding
Then separate DCNN on sentences to get document embedding
Evaluated on classification tasks
(Denil et al., 2014)
Convolutional model for text classification and sentiment prediction
Compares strategies for input word representation using word embeddings
Rich space of neural architectures
(Kim, 2014)
1. Embeddings are randomly initialized
and learnt during E2E training
2. Pre-trained embeddings not
updated during E2E training
3. Pre-trained embeddings further
tuned during E2E training
4. Concatenation of two different
vectors, both initialized with pre-
trained embeddings but only one
update during E2E training
Diff. strategy wins on diff.
dataset, size of E2E training data
must be an important factor
Stacked alternate layers of convolution and max-pooling
Rich space of neural architectures
(Hu et al., 2014)
Depth of network is fixed
Unlike recursive models, the weights
are not shared between convolutions
at different depth
ARC-I
Rich space of neural architectures
Convolutional model for short text ranking
Pre-trained word embeddings for input, not tuned
during E2E training
Additional layers for Q-D embedding comparison,
could also include non embedding features
Pointwise loss function, model predicts relevance class (Severyn and Moschitti, 2015)
Evaluated on microblog retrieval, unfortunately no comparisons to DSSM / CDSSM
Rich space of neural architectures
Replace convolution-pooling w/ LSTM
(Palangi et al., 2015)
LSTM-DSSM
Rich space of neural architectures
Recently various tree-structured
RNNs / LSTMs have also been
explored in the context of text
similarity and classification tasks
(Tai et al., 2015) (Zhao et al., 2015)
Interaction
matrix
Alternative to Siamese networks
Interaction matrix X, where xi,j is obtained by
comparing the the ith word in source sentence
with jth word in target sentence
Comparison is generally lexical or based on
pre-trained embeddings
Focus on recognizing good matching patterns
instead of learning good representations
(Hu et al., 2014)
(Pang et al., 2014)
MatchPyramid
ARC-II
Ad-hoc retrieval using
local representation
Deep Relevance Matching Model (DRMM) for
ad-hoc retrieval
Argues exact matching more important than
representation learning for ad-hoc retrieval
DNN on top of histogram based features
Gating network based on term IDF score
Word embeddings are also used but their
impact not clear
(Guo et al., 2016)
(Mitra et al., 2017)
Ad-hoc retrieval using
local and distributed
representation
Argues both “lexical” and “semantic” matching is
important for document ranking
Duet model is a linear combination of two DNNs using
local and distributed representations of query/document
as inputs, and jointly trained on labelled data
Local model operates on lexical interaction matrix
Distributed model operates on n-graph representation of
query and document text
(Mitra et al., 2017)
Ad-hoc retrieval using
local and distributed
representation
Argues both “lexical” and “semantic” matching is
important for document ranking
Duet model is a linear combination of two DNNs using
local and distributed representations of query/document
as inputs, and jointly trained on labelled data
Local model operates on lexical interaction matrix
Distributed model operates on n-graph representation of
query and document text
Using our earlier taxonomy of models…
Query text
Generate query
term vector
Doc text
Generate doc
term vector
Generate interaction matrix
Query
term vector
Doc
term vector
Local model
DNN for matching
Query text
Generate query
embedding
Doc text
Generate doc
embedding
Hadamard product
Query
embedding
Doc
embedding
Distributed model
DNN for matching
(Mitra et al., 2017)
Ad-hoc retrieval using
local and distributed
representation
Distributed model not likely to have good representation for rare
intents (e.g., a television model number ‘SC32MN17’)
For the query “what channel are the seahawks on today”
documents may contain the term “ESPN” instead – distributed
model likely to make the “channel” → “ESPN” connection
Distributed model more effective for popular intents, fall back on
local model for rare ones
Local and distributed models likely to make different mistakes
and therefore more effective when combined
The library vs.
librarian dilemma
Is there a hidden assumption in
distributed models that they know
about everything in the universe?
The distributed model knows more
about “Barack Obama” than “Bhaskar Mitra” and will perform better on the former
The local model doesn’t understand either but therefore might perform better for the latter query
compared to the distributed model
Is your model the library (knows about the world) or the librarian (knows how to find information in a
without much prior domain knowledge)?
Big vs. small
data regimes
Big data seems to be more crucial for models that focus on
good representation learning for text
Partial supervision strategies (e.g., unsupervised pre-training
of word embeddings) can be effective but may be leaving the
bigger gains on the table
Learning to train on unlabeled data
may be key to making progress on
neural ad-hoc retrieval
Which IR models are similar?
Clustering based on query level
retrieval performance.
Implementing Duet using the
Cognitive Toolkit (CNTK)
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
Other
Applications
Chapter 7
Query
Recommendations
Hierarchical sequence-to-sequence model for term-
by-term query generation
Similar to ad-hoc ranking the DNN feature alone
performs poorly but shows significant improvements
over a model with lexical contextual features
(Sordoni et al., 2015)
Query
auto-completion
Given a (rare) query prefix retrieve relevant
suffixes from a fixed set
CDSSM model trained on query prefix-suffix pairs
can be used for suffix ranking
Training on prefix-suffix produces a leads to a
more “Typical” embedding space
(Mitra and Craswell, 2015)
Analogies for
session modelling CDSSM vectors when trained on prefix-suffix pairs or
session query pairs are amenable to analogical tasks
(Mitra, 2015)
Analogies for
session modelling
This property was shown to be useful for modelling
session context
"london" → "things to do in london“
is similar to
"new york" → "new york tourist attractions“
Possible scope for modelling sessions as paths in the
embedding space?
(Mitra, 2015)
Multi-task
learning
DSSM model trained under a multi-task setting
e.g., query classification and ranking
Lower layer weights are shared, top layer specific
to the task
Leverages cross-task data as well as reduces over-
fitting to particular task – learnt embedding may
be useful for related tasks not directly trained for
(Liu et al., 2015)
Neural approach for
Knowledge-based IR
Incorporating distributed
representations of KB entities in
deep neural retrieval models
Advocates that KBs can either
support learning of latent
representations of text or serve
as a semantic translator between
their latent representations
(Nguyen et al., 2016)
Conversational
response retrieval
Ranking responses for conversational systems
Interesting challenges in modelling utterance
level discourse structure and context
Can multi-turn conversational tasks become a
key playground for neural retrieval models?
(Yan et al., 2016)
(Zhou et al., 2016)
Proactive
retrieval
Given current task context generate a proactive query
and retrieve recommendations
In this particular paper authors focused on
recommendations during a document authoring task
(Luukkonen et al., 2016)
Multimodal
retrieval
This tutorial is focused on text but
neural representation learning is also
leading towards breakthroughs in
multimodal retrieval
(Ma et al., 2015)
Useful Resources
Pyndri for Python: https://github.com/cvangysel/pyndri
Luandri for LUA / Torch: https://github.com/bmitra-msft/Luandri
public word embedding datasets
Model Description
Word2vec ~3M words vocabulary trained on Google News dataset Link
GloVe Multiple datasets with ~400K-2M word vocabulary Link
DESM IN + OUT embeddings for ~3M words trained on Bing query logs Link
NTLM Multiple datasets with ~50K-3M word vocabulary Link
Next gen neural IR models (using
reinforcement learning or
adversarial learning) may need to
perform retrieval as part of the
model training / evaluation step
Bhaskar Mitra
@UnderdogGeek
bmitra@microsoft.com
Nick Craswell
@nick_craswell
nickcr@microsoft.com
send us your questions and feedback
during or after the tutorial

More Related Content

What's hot

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...Alejandro Bellogin
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarNaresh Jain
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
How to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectHow to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectSease
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 
Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

What's hot (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
How to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectHow to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank Project
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 
Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Text clustering
Text clusteringText clustering
Text clustering
 
Feature selection
Feature selectionFeature selection
Feature selection
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Viewers also liked

“Probabilistic Logic Programs and Their Applications”
“Probabilistic Logic Programs and Their Applications”“Probabilistic Logic Programs and Their Applications”
“Probabilistic Logic Programs and Their Applications”diannepatricia
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distanceananth
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelYanbin Kong
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Technological Unemployment and the Robo-Economy
Technological Unemployment and the Robo-EconomyTechnological Unemployment and the Robo-Economy
Technological Unemployment and the Robo-EconomyMelanie Swan
 
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...Association for Computational Linguistics
 
iPhone5c的最后猜测
iPhone5c的最后猜测iPhone5c的最后猜测
iPhone5c的最后猜测Yanbin Kong
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHenning Spjelkavik
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning ExplainedMelanie Swan
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)Jaemin Cho
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS MeetupLINAGORA
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 

Viewers also liked (20)

“Probabilistic Logic Programs and Their Applications”
“Probabilistic Logic Programs and Their Applications”“Probabilistic Logic Programs and Their Applications”
“Probabilistic Logic Programs and Their Applications”
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Technological Unemployment and the Robo-Economy
Technological Unemployment and the Robo-EconomyTechnological Unemployment and the Robo-Economy
Technological Unemployment and the Robo-Economy
 
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Re...
 
iPhone5c的最后猜测
iPhone5c的最后猜测iPhone5c的最后猜测
iPhone5c的最后猜测
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.no
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS Meetup
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
 

Similar to Neural Text Embeddings for Information Retrieval (WSDM 2017)

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applicationsMark Greaves
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...linshanleearchive
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...Iman Mirrezaei
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Alex Curtis
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationHPCC Systems
 

Similar to Neural Text Embeddings for Information Retrieval (WSDM 2017) (20)

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applications
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)Practical cases, Applied linguistics course (MUI)
Practical cases, Applied linguistics course (MUI)
 
download
downloaddownload
download
 
download
downloaddownload
download
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text Classification
 

More from Bhaskar Mitra

So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Bhaskar Mitra
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 

More from Bhaskar Mitra (19)

So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Neural Text Embeddings for Information Retrieval (WSDM 2017)

  • 1. WSDM 2017 Tutorial Neural Text Embeddings for IR Bhaskar Mitra and Nick Craswell Download slides from: http://bit.ly/NeuIRTutorial-WSDM2017
  • 2. Check out the full tutorial: https://arxiv.org/abs/1705.01509
  • 3. SIGIR papers with title words: Neural, Embedding, Convolution, Recurrent, LSTM 1% 4% 8% 11% 20% 0% 5% 10% 15% 20% 25% SIGIR 2014 (accepted) SIGIR 2015 (accepted) SIGIR 2016 (accepted) SIGIR 2017 (submitted) SIGIR 2018 (optimistic?) Neural network papers @ SIGIR Deep Learning Amazingly successful on many hard applied problems. Dominating multiple fields: (But also beware the hype.) Christopher Manning. Understanding Human Language: Can NLP and Deep Learning Help? Keynote SIGIR 2016 (slides 71,72) 0% 5% 10% 15% 20% 25% SIGIR 2014 (accepted) SIGIR 2015 (accepted) SIGIR 2016 (accepted) SIGIR 2017 (submitted) SIGIR 2018 (optimistic?) Neural network papers @ SIGIR 2011 2013 2015 2017 speech vision NLP IR
  • 4. Neural methods for information retrieval This tutorial mainly focuses on: • Ranked retrieval of short and long texts, given a text query • Shallow and deep neural networks • Representation learning For broader topics (multimedia, knowledge) see: • Craswell, Croft, Guo, Mitra, and de Rijke. Neu-IR: Workshop on Neural Information Retrieval. SIGIR 2016 workshop • Hang Li and Zhengdong Lu. Deep Learning for Information Retrieval. SIGIR 2016 tutorial
  • 5. Today’s agenda 1. IR Fundamentals 2. Word embeddings 3. Word embeddings for IR 4. Deep neural nets 5. Deep neural nets for IR 6. Other applications We cover key ideas, rather than exhaustively describing all NN IR ranking papers. For a more complete overview see: • Zhang et al. Neural information retrieval: A literature review. 2016 • Onal et al. Getting started with neural models for semantic matching in web search. 2016 slides are available here for download
  • 8. Information retrieval (IR) terminology This tutorial: Using neural networks in the retrieval system to improve relevance. For text queries and documents. Information need query results ranking (document list) retrieval system indexes a document corpus Relevance (documents satisfy information need e.g. useful)
  • 9. IR applications Document ranking Factoid question answering Query Keywords Natural language question Document Web page, news article A fact and supporting passage TREC experiments TREC ad hoc TREC question answering Evaluation metric Average precision, NDCG Mean reciprocal rank Research solution Modern TREC rankers BM25, query expansion, learning to rank, links, clicks+likes (if available) IBM@TREC-QA Answer type detection, passage retrieval, relation retrieval, answer processing and ranking In products • Document rankers at: Google, Bing, Baidu, Yandex, … • Watson@Jeopardy • Voice search This NN tutorial Long text ranking Short text ranking
  • 10. Challenges in (neural) IR [slide 1/3] • Vocabulary mismatch Q: How many people live in Sydney?  Sydney’s population is 4.9 million [relevant, but missing ‘people’ and ‘live’]  Hundreds of people queueing for live music in Sydney [irrelevant, and matching ‘people’ and ‘live’] • Need to interpret words based on context (e.g., temporal) Today Recent In older (1990s) TREC data query: “uk prime minister” Vocab mismatch: • Worse for short texts • Still an issue for long texts
  • 11. Challenges in (neural) IR [slide 2/3] • Q and D vary in length • Models must handle variable length input • Relevant docs have irrelevant sections https://en.wikipedia.org/wiki/File:Size_distribution_among_Featured_Articles.png
  • 12. Challenges in (neural) IR [slide 3/3] • Need to learn Q-D relationship that generalizes to the tail • Unseen Q • Unseen D • Unseen information needs • Unseen vocabulary • Efficient retrieval over many documents • KD-Tree, LSH, inverted files, … Figure from: Goel, Broder, Gabrilovich, and Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. WWW Conference 2010
  • 14. One-hot representation (local) Dim = |V| sim(banana,mango) = 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 banana mango Notes: 1) Popular sim() is cosine, 2) Words/tokens come from some tokenization and transformation
  • 15. Hinton, Geoffrey E. Distributed representations. Technical Report CMU-CS-84-157, 1984
  • 16. Context-based distributed representation Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010 “You shall know a word by the company it keeps ” Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, p. 11. Blackwell, Oxford. banana mango sim(banana,mango) > 0 Appear in same documents Appear near same words Non-zero Zero
  • 17. banana nanana#ba na# ban banana (grows) (tree)(yellow) (on) (africa) banana Doc7 Doc9Doc2 banana (grows, +1) (tree, +3)(yellow, -1) (on, +2) (africa, +5) Word-Document Word-Word Word-WordDist Word hash (not context-based) “You shall know a word by the company it keeps ” DistributionalSemantics
  • 18. Distributional methods use distributed representations Distributed and distributional • Distributed representation: Vector represents a concept as a pattern, rather than 1-hot • Distributional semantics: Linguistic items with similar distributions (e.g. context words) have similar meanings http://www.marekrei.com/blog/26-things-i-learned-in-the-deep-learning-summer-school/ “You shall know a word by the company it keeps ”
  • 19. Vector Space Models For a given task: Choose matrix, choose 𝑠𝑖𝑗 weighting. Could be binary, could be raw counts. Example weighting: Positive Pointwise Mutual Information (Word-Word matrix) V: vocabulary, C: set of contexts, S: sparse matrix |V| x |C| c0 c1 c2 … cj … c|C| W0 W1 W2 … Wi Sij … w|V| Turney and Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 2010 PPMI weighting for Word-Word matrix TF-IDF weighting for Word-Document matrix
  • 20. Distributed word representation overview • Next we cover lower-dimensional dense representations • Including word2vec. Questions on advantage* • But first we consider the effect of context choice, in the data itself Data Counts matrix (sparse) Learning from counts matrix Learning from individual instances Word-Document Vector space models LSA Paragraph2Vec PV-DBOW Word-Word Vector space models GloVe Word2vec Word-WordDist Vector space models context * Baroni, Dinu and Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014 Levy, Goldberg and Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the ACL 3:211-225
  • 21. Let’s consider the following example… We have four (tiny) documents, Document 1 : “seattle seahawks jerseys” Document 2 : “seattle seahawks highlights” Document 3 : “denver broncos jerseys” Document 4 : “denver broncos highlights”
  • 22. Using Word-Document context seattle Document 1 Document 3 Document 2 Document 4 seahawks denver broncos similar similar This is Topical or Syntagmatic similarity.
  • 23. Using Word-WordDist context seattle (seattle, -1) (denver, -1) (seahawks, +1) (broncos, +1) (jerseys, + 1) (jerseys, + 2) (highlights, +1) (highlights, +2) seahawks denver broncos similar similar This is Typical or Paradigmatic Similarity.
  • 24. Let’s consider another example… “seattle map” “seattle weather” “seahawks jerseys” “seahawks highlights” “seattle seahawks wilson” “seattle seahawks sherman” “seattle seahawks browner” “seattle seahawks lfedi” “denver map” “denver weather” “broncos jerseys” “broncos highlights” “denver broncos lynch” “denver broncos sanchez” “denver broncos miller” “denver broncos marshall” With more (tiny) documents…
  • 25. Using Word-Word context seattle seattle denver seahawks broncos jerseys highlights wilson sherman seahawks denver broncos browner lfedi lynch sanchez miller marshall map weather similar similar 1. Word-Word is less sparse than Word-Document (Yan et al., 2013) 2. A mix of topical and typical similarity (function of window size)
  • 26. Paradigmatic vs Syntagmatic Do we choose one axis or the other, or both, or something else? Can we learn a general-purpose word representation that solves all our IR problems? Or do we need several? seattle seahawks richard sherman jersey denver broncos trevor siemian jersey paradigmatic (or typical) axis syntagmatic (or topical) axis
  • 27. Embeddings (distributed) Dense. Dim = 200 (for example) banana mango dog
  • 28. Latent Semantic Analysis V: vocabulary, D: set of documents, X: sparse matrix |V| x |D| xij = TF-IDF Singular value decomposition of X: X = UΣVT d0 d1 d2 … dj … d|D| t0 t1 t2 … ti xij … t|V| Deerwester, Dumais, Furnas, Landauer and Harshman. Indexing by latent semantic analysis. JASIS 41, no. 6, 1990
  • 29. Latent Semantic Analysis σ1, …, σl: singular values u1, …, ul: left singular vectors v1, …, vl: right singular vectors The k largest singular values, and corresponding singular vectors from U and V, is the rank k approximation of X Xk = UkΣkVk T Embedding of ith term = Σk ti Source: en.wikipedia.org/wiki/Latent_semantic_analysis ^ Dumais. Latent semantic indexing (LSI): TREC-3 report. NIST Special Publication SP (1995): 219-219.
  • 30. Word2vec Goal: simple (shallow) neural model learning from billion words scale corpus Predict middle word from neighbors within a fixed size context window Two different architectures: 1. Skip-gram 2. CBOW (Mikolov et al., 2013)
  • 31. Skip-gram Predict neighbor 𝑤𝑡+𝑗 given word 𝑤𝑡 Maximizes following average log prob. WIN WOUT wt wt+j ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = 1 𝑇 𝑡=1 𝑇 −𝑐≤𝑗≤𝑐,𝑗≠0 log 𝑝 𝑤𝑡+𝑗|𝑤𝑡 𝑝 𝑤𝑡+𝑗|𝑤𝑡 = 𝑒𝑥𝑝 𝑊𝑂𝑈𝑇 𝑤𝑡+𝑗 ⊺ 𝑊𝐼𝑁 𝑤𝑡 𝑣=1 𝑉 𝑒𝑥𝑝 𝑊𝑂𝑈𝑇 𝑤𝑣 ⊺ 𝑊𝐼𝑁 𝑤𝑡 Full softmax is computationally impractical. Word2vec uses hierarchical softmax or negative sampling instead. (Mikolov et al., 2013)
  • 32. Continuous Bag-of-Words Predict word given bag-of-neighbors Modify the skip-gram loss function. ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = 1 𝑇 𝑡=1 𝑇 log 𝑝 𝑤𝑡|𝑤𝑡∗ 𝑤𝑡∗ = −𝑐≤𝑗≤𝑐,𝑗≠0 𝑤𝑡+𝑗 WIN WOUT wt+2wt+1 wt wt-2wt-1 wt* (Mikolov et al., 2013)
  • 33.
  • 34.
  • 35. Word analogies with word2vec [king] – [man] + [woman] ≈ [queen] (Mikolov et al., 2013)
  • 36. Word analogies can work in underlying data too seattle seattle denver seahawks broncos jerseys highlights wilson sherman seahawks denver broncos similar browner lfedi lynch sanchez miller marshall [seahawks] – [seattle] + [Denver] map weather Sparse vectors can work well for an analogy task: Levy, Goldberg and Ramat-Gan. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL 2014
  • 37. A Matrix Interpretation of word2vec Skip-gram looks like this: If we aggregate over all training samples… ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = − 𝐴 𝐵 log 𝑝 𝐵|𝐴 ℒ 𝑠𝑘𝑖𝑝−𝑔𝑟𝑎𝑚 = − 𝑖=1 𝑉 𝑗=1 𝑉 𝑥𝑖,𝑗 log 𝑝 𝑤𝑖|𝑤𝑗 = − 𝑖=1 𝑉 𝑥𝑖 𝑗=1 𝑉 𝑋𝑖,𝑗 𝑋𝑖 log 𝑝 𝑤𝑖|𝑤𝑗 = 𝑖=1 𝑉 𝑥𝑖 𝐻 𝑃 𝑤𝑖|𝑤𝑗 , 𝑝 𝑤𝑖|𝑤𝑗 (Pennington et al., 2014) = − 𝑖=1 𝑉 𝑥𝑖 𝑗=1 𝑉 𝑃 𝑤𝑖|𝑤𝑗 log 𝑝 𝑤𝑖|𝑤𝑗 cross-entropyactual co-occurence probability co-occurence probability predicted by the model w0 w1 w2 … wj … w|V| w0 w1 w2 … wi xij … w|V|
  • 38. GloVe Variety of windows sizes and weighting AdaGrad w0 w1 w2 … wj … w|V| w0 w1 w2 … wi Xij … w|V| (Pennington et al., 2014) ℒ 𝐺𝑙𝑜𝑉𝑒 = − 𝑖=1 𝑉 𝑗=1 𝑉 𝑓 𝑋𝑖,𝑗 𝑙𝑜𝑔𝑋𝑖,𝑗 − 𝑤𝑖 ⊺ 𝑤𝑗 2 squared erroractual co-occurence probability` co-occurence probability predicted by the model weighting functionweighting function actual co-occurence probability` weighting function
  • 39. Discussion of word representations • Representations: Weighted counts, SVD, neural embedding • Use similar data (W-D, W-W), capture similar semantics • For example, analogies using counts • Think sparse, act dense • Stochastic gradient descent allows us to scale to larger data • On individual instances (w2v) or count matrix data (GloVe) • Scalability could be the key advantage for neural word embeddings • Choice of data affects semantics • Paradigmatic vs Syntagmatic • Which is best for your application?
  • 41. Traditional IR feature design • Inverse document frequency • Term frequency • Adjust TF model for doc length n R N r Robertson and Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, no. 4 (2009) 0 0.1 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 10 12 14 16 18 20 ProbabilityMass Term Frequency of t in D D not about t D is about t Robertson and Sparck-Jones (1977) Harter (1975) 2-Poisson model of TF RSJ naïve Bayes model of IDF Rank D by: 𝑡∈𝑄 𝑇𝐹 𝑡, 𝐷 ∗ 𝐼𝐷𝐹(𝑡)
  • 42. How to incorporate embeddings A. Extend traditional IR models • Term weighting, language model smoothing, translation of vocab B. Expand query using embeddings (followed by non-neural IR) • Add words similar to the query C. IR models that work in the embedding space • Centroid distance, word mover’s distance
  • 43. Term weighting using word embeddings 𝑦 = 𝑟 𝑅 (term recall) • Fraction of positive docs with t • The 𝑟 and 𝑅 were missing in RSJ 𝒙 = 𝒕 − 𝒒 • 𝒕 is embedding of t • 𝒒 is centroid of all query terms • Weight TF-IDF using 𝑦 Zheng and Callan. Learning to reweight terms with distributed representations. SIGIR 2015 𝒒 𝒕
  • 44. Traditional IR model: Query likelihood • Language modeling approach to IR is quite extensible p 𝑑 𝑞 ≈ 𝑝 𝑞 𝑑 𝑝 𝑑 • p 𝑑 can be assumed uniform across docs • p 𝑞 𝑑 = 𝑤∈𝑞 𝑝(𝑤|𝑑) depends on modeling the document • Frequent words in 𝑑 are more likely (term frequency) • Smoothing according to the corpus (plays the role of IDF) • Various ways of dealing with document length normalization Ponte and Croft. A language modeling approach to information retrieval. SIGIR 1998 Zhai and Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR 2001
  • 45. Generalized Language Model Ganguly, Roy, Mitra, and Jones. Word embedding based generalized language model for information retrieval. SIGIR 2015 Compares query term with every document term
  • 46. Neural Language Translation Model Zuccon, Koopman, Bruza, and Azzopardi. "Integrating and evaluating neural word embeddings in information retrieval." Australasian Document Computing Symposium 2015 Translation probability from document term u to query term w Considers all query-document term pairs Based on: Berger and Lafferty. "Information retrieval as statistical translation." SIGIR 1999
  • 47. B. Query expansion Both passages have same number of gold query matches. Yet non-query green matches can be a good evidence of aboutness. We need methods to consider non-query terms. Traditionally: automatic query expansion. Example from Mitra et al., 2016 Query: Albuquerque
  • 48. Query expansion using w2v Identify expansion terms using w2v cosine similarity 3 different strategies: pre-retrieval, post-retrieval, and pre-retrieval incremental Beats no expansion, but does not beat non-neural expansion Roy, Paul, Mitra and Garain. Using Word embeddings for automatic query expansion. Neu-IR Workshop 2016
  • 49. Query expansion using w2v • Related paper. Distance 𝛿 has a sigmoid transform. First model: • Their second model uses the first round of retrieval • Can beat non-neural expansion Zamani and Croft. Embedding-based query language models. International Conference on the Theory of Information Retrieval 2016 Lavrenko and Croft. Relevance based language models. SIGIR 2001
  • 50. Optimizing the query vector Zamani and Croft. Estimating embedding vectors for queries. International Conference on the Theory of Information Retrieval 2016
  • 51. Global vs. local embedding spaces Train w2v on documents from first round of retrieval Fine-grained word sense disambiguation A large number of embedding spaces can be cached in practice (Diaz et al., 2016)
  • 52. C. IR models in the embedding space • Q: Bag of word vectors • D: Bag of word vectors • Considerations: • Deal with variable length of Q and D • Match all pairs? Do some alignment of Q and D?
  • 53. Comparing short texts • Weighted semantic network • Related to word alignment • Each word in longer text is connected to its most similar • BM25-like edge weighting • Generates features for supervised learning of short text similarity Kenter and de Rijke. Short text similarity with word embeddings. CIKM 2015
  • 54. Word Mover’s Distance Kusner, Sun, Kolkin and Weinberger. From Word Embeddings To Document Distances. ICML 2015 See also: Huang, Guo, Kusner, Sun, Weinberger and Sha. Supervised Word Mover’s Distance. NIPS 2016 Sparse normalized TF vectors: 𝒅 = 𝒅′ = 1 Sparse flow matrix: 𝑗 𝑇𝑖𝑗 = 𝑑𝑖 𝑖 𝑇𝑖𝑗 = 𝑑′𝑗 Minimize cost: 𝑖,𝑗 𝑇𝑖𝑗 𝑐(𝑖, 𝑗)  word2vec
  • 55. Bounds on word mover’s distance WCD <= RWMD <= WMD WCD: Word centroid distance RWMD: Relaxed WMD Prefetch and prune algorithm: • Sort by WCD • Compute WMD on top-k • Continue, using RWMD to prune 95% of WMD Kusner, Sun, Kolkin and Weinberger. From Word Embeddings To Document Distances. ICML 2015
  • 56. Centroid and related techniques • Goal: Bag of vectors  Single fixed size vector • Centroid, weighted centroid (or sum [Vulic and Moens, 2015]) • Aggregate using a Fisher Kernel • Clinchant and Perronnin. Aggregating continuous word embeddings for information retrieval. (ACL) Workshop on Continuous Vector Space Models and their Compositionality, 2013 • Using Fisher Kernel from Jaakkola and Haussler (1999)
  • 57. What if I told you everyone using w2v is throwing half the model away? Two sets of embeddings are trained (WIN and WOUT) But WOUT is generally discarded IN-OUT dot product captures log prob. of co-occurrence WIN WOUT Wt+j wt (Mitra et al., 2016)
  • 58. Notions of Similarity IN-OUT captures more Topical notion of similarity than IN-IN and OUT-OUT Effect is exaggerated when embeddings are trained on short text (e.g., queries) (Mitra et al., 2016)
  • 59. Dual Embedding Space Model Compare query terms with every document term Equivalent to taking centroid of all term vectors Combine with traditional retrieval model (Mitra et al., 2016) Query and document representation: Centroid of all words, including repetition
  • 60. Telescoping • DESM also works well when “telescoping” i.e. reranking a small set • For example, DESM can confuse Oxford and Cambridge • Bing rarely makes the Oxford-Cambridge mistake • The upstream ranker (Bing) saves DESM from making that mistake • If DESM sees more documents, it may make the mistake Matveeva, Burges, Burkard, Laucius, and Wong. High accuracy retrieval with multiple nested ranker. SIGIR 2006
  • 61. Because Cambridge is not "an African even-toed ungulate mammal" https://github.com/bmitra-msft/Demos/blob/master/notebooks/DESM.ipynb
  • 62. Paragraph2vec PV-DM A mix of LSA and and w2v style training data, less sparse than training just on paragraph ID PV-DBOW Effectively modelling the same relationship as LSA (factorizing word-document matrix) Le and Mikolov. Distributed Representations of Sentences and Documents. ICML 2014
  • 63. Models of similar flavour Grbovic, Djuric, Radosavljevic, Silvestri and Bhamidipati. Context-and content-aware embeddings for query rewriting in sponsored search. SIGIR 2015. Sun, Guo, Lan, Xu and Cheng Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations ACL 2015
  • 64. Adapting PV-DBOW for IR Negative sampling using IDF instead of corpus frequency L2 regularization constraint on the norm to avoid over-fitting on short documents Reduce sparseness by predicting context words (similar to PV-DM?) Ai, Yang, Guo and Croft. Analysis of the Paragraph Vector Model for Information Retrieval. ICTIR 2016
  • 65. Discussion of embeddings in IR • The right mix of lexical and semantic matching is important • In non-telescoping settings, need to combine with lexical IR • Many IR models do some sort of vector comparison of Q-D words • Open question: Which notion of similarity is best for each IR model • Typical vs Topical vs some mix • These methods seem promising if: • High-quality embeddings domain-appropriate embeddings available • No large-scale supervised IR data available • If large-scale supervised IR data is available… (after the break)
  • 67. A primer on neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (σ) Popular choices for σ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backward pass Expected output loss Tanh ReLU
  • 68. Visual motivation for hidden units Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗ can’t separate using a linear model! Or more succinctly… Input features Label surface kerberos book library 1 0 1 0 ✓ 1 1 0 0 ✗ 0 1 0 1 ✓ 0 0 1 1 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer…
  • 69. Visual motivation for hidden units Or more succinctly… Input features Hidden layer Label surface kerberos book library H1 H2 1 0 1 0 1 0 ✓ 1 1 0 0 0 0 ✗ 0 1 0 1 0 1 ✓ 0 0 1 1 0 0 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… can separate using a linear model! Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
  • 70. Why adding depth helps Deeper networks can split the input space in many (non-independent) linear regions than shallow networks Montúfar, Pascanu, Cho and Bengio. On the number of linear regions of deep neural networks NIPS 2014
  • 71. Why adding depth helps http://playground.tensorflow.org
  • 72. Neural models for text But how do you feed text to a neural network? Dogs have owners, cats have staff.
  • 73. Local representations of input text Char-level models d o g s h a v e o w n e r s c a t s h a v e s t a f f one-hotvectors concatenate channels [chars x channels]
  • 74. Local representations of input text Word-level models w/ bag-of-chars per word d o g s h a v e o w n e r s c a t s h a v e s t a f f one-hotvectors concatenate sum sum sum sum sum sum channels [words x channels]
  • 75. # d o g s # # h a v e # # o w n e r s # # c a t s # # h a v e # # s t a f f # Local representations of input text Word-level models w/ bag-of-trigraphs per word one-hotvectors concatenate or sum sum sum sum sum sum sum channels [words x channels] or [1 x channels]
  • 76. Local representations of input text Word-level models w/ pre-trained embeddings d o g s h a v e o w n e r s c a t s h a v e s t a f f pre-trained embeddings concatenate or sum channels [words x channels] or [1 x channels]
  • 77. Shift-invariant neural operations Detecting a pattern in one part of input space is same as detecting it in another (also applies to sequential inputs, and inputs with dims >2) Leverage redundancy by operating on a moving window over whole input space and then aggregate The repeatable operation is called a kernel, filter, or cell Aggregation strategies leads to different architectures
  • 78. Convolution Move the window over the input space each time applying the same cell over the window A typical cell operation can be, ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] Cell Output [1 x out_channels] Full Output [1 + (words – window) / stride x out_channels] 𝑋 ℎ output
  • 79. Pooling Move the window over the input space each time applying an aggregate function over each dimension in within the window ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 + (words – window) / stride x channels] 𝑋 ℎ outputmax -pooling average -pooling
  • 80. Convolution w/ Global Pooling Stacking a global pooling layer on top of a convolutional layer is a common strategy for generating a fixed length embedding for a variable length text Full Input [words x in_channels] Full Output [1 x out_channels] convolutionpooling output
  • 81. Recurrent Neural Network Similar to a convolution layer but additional dependency on previous hidden state A simple cell operation shown below but others like LSTM and GRUs are more popular in practice, ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] + [1 x out_channels] Cell Output [1 x out_channels] Full Output [1 x out_channels] 𝑋𝑖 ℎ𝑖ℎ𝑖−1 output
  • 82. Recursive NN or Tree-RNN Shared weights among all a levels of the tree Cell can be an LSTM or as simple as ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 x channels] output
  • 83. Autoencoders Unsupervised models trained to minimize reconstruction errors Information Bottleneck method (Tishby et al., 1999) The bottleneck layer captures “minimal sufficient statistics” of X and is a compressed representation of the input Image source: https://en.wikipedia.org/wiki/Autoencoder
  • 84. Computation Networks The “Lego” approach to specifying DNN architectures Library of computation nodes, each node defines logic for: 1. Forward pass: compute output given input 2. Backward pass: compute gradient of loss w.r.t. inputs, given gradient of loss w.r.t. outputs 3. Parameter gradient: compute gradient of loss w.r.t. parameters, given gradient of loss w.r.t. outputs Chain nodes to create bigger and more complex networks
  • 85. Really Deep Neural Networks (Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
  • 87. Taxonomy of Models Neural networks can be used to: 1. Generate query representation, 2. Generate document representation, and/or 3. Estimate relevance. This can involve pre-trained representations or end-to-end training or some combination of approaches. Query text Generate query representation Doc text Generate doc representation Estimate relevance Query vector Doc vector Point of query representation Point of match Point of doc representation
  • 88. You’ve already seen… Query text Query expansion using embeddings Doc text Generate doc term vector Query likelihood Query term vector Doc term vector Query text Generate query embedding Doc text Generate doc embedding Cosine similarity Query embedding Doc embedding e.g., DESM (Mitra et al., 2016) e.g., Roy et al., 2016 and Diaz et al, 2016
  • 89. What does ad-hoc IR data look like? search queries document corpus user interaction / click data human annotated labels Traditional IR uses human labels as ground truth for evaluation So ideally we want to train our ranking models on human labels User interaction data is rich but may contain different biases compared to human annotated labels
  • 90. search queries document corpus user interaction / click data human annotated labels Traditional IR uses human labels as ground truth for evaluation So ideally we want to train our ranking models on human labels User interaction data is rich but may contain different biases compared to human annotated labels What does ad-hoc IR data look like?
  • 91. search queries document corpus user interaction / click data human annotated labels In industry: Document corpus: billions? Query corpus: many billions? Labelled data w/ raw text: hundreds of thousands of queries Labelled data w/ learning-to-rank style features: same as above User interaction data: billions? What does ad-hoc IR data look like?
  • 92. In academia: Document corpus: few billion Query corpus: few million but other sources (e.g., wiki titles) can be used Labelled data w/ raw text: few hundred to few thousand queries Labelled data w/ learning-to-rank style features: tens of thousands of queries search queries document corpus human annotated labels What does ad-hoc IR data look like?
  • 93. A pessimistic summary… “ ” Water, water, everywhere, Nor any drop to drink.^
  • 94. Levels of Supervision (or the optimistic view) Unsupervised • Train embeddings on unlabeled corpus and use in traditional IR models • E.g., GLM, NTLM, DESM, Semantic hashing Semi-supervised • DNN models using pre- trained embeddings for input text representation • E.g., DRMM, MatchPyramid Fully supervised • DNNs w/ raw text input (one-hot word vectors or n-graph vectors) trained on labels or click • E.g., Duet *Of course, ad-hoc retrieval isn’t everything. A lot of recent DNN for IR literature focuses on other tasks such as short-text similarity or QnA, where enough training data is available. We covered most of these
  • 95. Semantic hashing Autoencoder minimizing document reconstruction error Vocab: 2K popular words (stopwords removed) In/out = word counts and binary hidden units Stacked RBMs w/ layer-by-layer pre-training followed by E2E tuning Class labels based eval, no relevance judgments (Salakhutdinov and Hinton, 2007)
  • 96. Deep Semantic Similarity Model Siamese network trained E2E on query and document title pairs Relevance is estimated by cosine similarity between query and document embeddings Input: character trigraph counts (bag of words assumption) Minimizes cross-entropy loss against randomly sampled negative documents (Huang et al., 2013)
  • 97. Convolutional DSSM Replace bag-of-words assumption by concatenating term vectors in a sequence on the input (Shen et al., 2014) Convolution followed by global max-pooling Performance improves further by including more negative samples during training (50 vs. 4)
  • 98. Rich space of neural architectures (Palangi et al., 2015) (Kalchbrenner et al., 2014) (Denil et al., 2014) (Kim, 2014) (Severyn and Moschitti, 2015) (Tai et al., 2015) (Zhao et al., 2015) (Hu et al., 2014)
  • 99. Rich space of neural architectures (Kalchbrenner et al., 2014) Input representation based on pre-trained word embeddings Wide vs. narrow convolution (presence/absence of zero padding) Dynamic k max-pooling, where k is a function of the number of convolutional layers and sentence length Evaluated on non-IR tasks (sentiment prediction and question-type classification) Dynamic Convolutional Neural Network
  • 100. Rich space of neural architectures Uses the DCNN model to learn document embeddings First DCNN on word embeddings to get sentence embedding Then separate DCNN on sentences to get document embedding Evaluated on classification tasks (Denil et al., 2014)
  • 101. Convolutional model for text classification and sentiment prediction Compares strategies for input word representation using word embeddings Rich space of neural architectures (Kim, 2014) 1. Embeddings are randomly initialized and learnt during E2E training 2. Pre-trained embeddings not updated during E2E training 3. Pre-trained embeddings further tuned during E2E training 4. Concatenation of two different vectors, both initialized with pre- trained embeddings but only one update during E2E training Diff. strategy wins on diff. dataset, size of E2E training data must be an important factor
  • 102. Stacked alternate layers of convolution and max-pooling Rich space of neural architectures (Hu et al., 2014) Depth of network is fixed Unlike recursive models, the weights are not shared between convolutions at different depth ARC-I
  • 103. Rich space of neural architectures Convolutional model for short text ranking Pre-trained word embeddings for input, not tuned during E2E training Additional layers for Q-D embedding comparison, could also include non embedding features Pointwise loss function, model predicts relevance class (Severyn and Moschitti, 2015) Evaluated on microblog retrieval, unfortunately no comparisons to DSSM / CDSSM
  • 104. Rich space of neural architectures Replace convolution-pooling w/ LSTM (Palangi et al., 2015) LSTM-DSSM
  • 105. Rich space of neural architectures Recently various tree-structured RNNs / LSTMs have also been explored in the context of text similarity and classification tasks (Tai et al., 2015) (Zhao et al., 2015)
  • 106. Interaction matrix Alternative to Siamese networks Interaction matrix X, where xi,j is obtained by comparing the the ith word in source sentence with jth word in target sentence Comparison is generally lexical or based on pre-trained embeddings Focus on recognizing good matching patterns instead of learning good representations (Hu et al., 2014) (Pang et al., 2014) MatchPyramid ARC-II
  • 107. Ad-hoc retrieval using local representation Deep Relevance Matching Model (DRMM) for ad-hoc retrieval Argues exact matching more important than representation learning for ad-hoc retrieval DNN on top of histogram based features Gating network based on term IDF score Word embeddings are also used but their impact not clear (Guo et al., 2016)
  • 108. (Mitra et al., 2017) Ad-hoc retrieval using local and distributed representation Argues both “lexical” and “semantic” matching is important for document ranking Duet model is a linear combination of two DNNs using local and distributed representations of query/document as inputs, and jointly trained on labelled data Local model operates on lexical interaction matrix Distributed model operates on n-graph representation of query and document text
  • 109. (Mitra et al., 2017) Ad-hoc retrieval using local and distributed representation Argues both “lexical” and “semantic” matching is important for document ranking Duet model is a linear combination of two DNNs using local and distributed representations of query/document as inputs, and jointly trained on labelled data Local model operates on lexical interaction matrix Distributed model operates on n-graph representation of query and document text
  • 110. Using our earlier taxonomy of models… Query text Generate query term vector Doc text Generate doc term vector Generate interaction matrix Query term vector Doc term vector Local model DNN for matching Query text Generate query embedding Doc text Generate doc embedding Hadamard product Query embedding Doc embedding Distributed model DNN for matching
  • 111. (Mitra et al., 2017) Ad-hoc retrieval using local and distributed representation Distributed model not likely to have good representation for rare intents (e.g., a television model number ‘SC32MN17’) For the query “what channel are the seahawks on today” documents may contain the term “ESPN” instead – distributed model likely to make the “channel” → “ESPN” connection Distributed model more effective for popular intents, fall back on local model for rare ones Local and distributed models likely to make different mistakes and therefore more effective when combined
  • 112. The library vs. librarian dilemma Is there a hidden assumption in distributed models that they know about everything in the universe? The distributed model knows more about “Barack Obama” than “Bhaskar Mitra” and will perform better on the former The local model doesn’t understand either but therefore might perform better for the latter query compared to the distributed model Is your model the library (knows about the world) or the librarian (knows how to find information in a without much prior domain knowledge)?
  • 113. Big vs. small data regimes Big data seems to be more crucial for models that focus on good representation learning for text Partial supervision strategies (e.g., unsupervised pre-training of word embeddings) can be effective but may be leaving the bigger gains on the table Learning to train on unlabeled data may be key to making progress on neural ad-hoc retrieval Which IR models are similar? Clustering based on query level retrieval performance.
  • 114. Implementing Duet using the Cognitive Toolkit (CNTK) https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
  • 116. Query Recommendations Hierarchical sequence-to-sequence model for term- by-term query generation Similar to ad-hoc ranking the DNN feature alone performs poorly but shows significant improvements over a model with lexical contextual features (Sordoni et al., 2015)
  • 117. Query auto-completion Given a (rare) query prefix retrieve relevant suffixes from a fixed set CDSSM model trained on query prefix-suffix pairs can be used for suffix ranking Training on prefix-suffix produces a leads to a more “Typical” embedding space (Mitra and Craswell, 2015)
  • 118. Analogies for session modelling CDSSM vectors when trained on prefix-suffix pairs or session query pairs are amenable to analogical tasks (Mitra, 2015)
  • 119. Analogies for session modelling This property was shown to be useful for modelling session context "london" → "things to do in london“ is similar to "new york" → "new york tourist attractions“ Possible scope for modelling sessions as paths in the embedding space? (Mitra, 2015)
  • 120. Multi-task learning DSSM model trained under a multi-task setting e.g., query classification and ranking Lower layer weights are shared, top layer specific to the task Leverages cross-task data as well as reduces over- fitting to particular task – learnt embedding may be useful for related tasks not directly trained for (Liu et al., 2015)
  • 121. Neural approach for Knowledge-based IR Incorporating distributed representations of KB entities in deep neural retrieval models Advocates that KBs can either support learning of latent representations of text or serve as a semantic translator between their latent representations (Nguyen et al., 2016)
  • 122. Conversational response retrieval Ranking responses for conversational systems Interesting challenges in modelling utterance level discourse structure and context Can multi-turn conversational tasks become a key playground for neural retrieval models? (Yan et al., 2016) (Zhou et al., 2016)
  • 123. Proactive retrieval Given current task context generate a proactive query and retrieve recommendations In this particular paper authors focused on recommendations during a document authoring task (Luukkonen et al., 2016)
  • 124. Multimodal retrieval This tutorial is focused on text but neural representation learning is also leading towards breakthroughs in multimodal retrieval (Ma et al., 2015)
  • 125. Useful Resources Pyndri for Python: https://github.com/cvangysel/pyndri Luandri for LUA / Torch: https://github.com/bmitra-msft/Luandri public word embedding datasets Model Description Word2vec ~3M words vocabulary trained on Google News dataset Link GloVe Multiple datasets with ~400K-2M word vocabulary Link DESM IN + OUT embeddings for ~3M words trained on Bing query logs Link NTLM Multiple datasets with ~50K-3M word vocabulary Link Next gen neural IR models (using reinforcement learning or adversarial learning) may need to perform retrieval as part of the model training / evaluation step

Editor's Notes

  1. Local representation Distributed representation One dimension for “banana” “banana” is a pattern Brittle under noise More robust to noise Precise Near “mango”, “pineapple”. (Nuanced) Add vocab  Add dimensions Add vocab  Generate more vectors K dimensions  K items K dimensions  2k “regions”
  2. Samuel Taylor Coleridge