The Duet model

Learning to Match Using Local and Distributed
Representations of Text for Web Search
Nick Craswell
Microsoft
Bellevue, USA
nickcr@microsoft.com
*work done while at Microsoft
Fernando Diaz
Spotify*
New York, USA
diazf@acm.org
Bhaskar Mitra
Microsoft, UCL
Cambridge, UK
bmitra@microsoft.com
The Duet Model:

The document ranking task
Given a query rank documents
according to relevance
The query text has few terms
The document representation can be
long (e.g., body text) or short (e.g., title)
query
ranked results
search engine w/ an
index of retrievable items

This paper is focused on ranking documents
based on their long body text

Many DNN models for short text ranking
(Huang et al., 2013)
(Severyn and Moschitti, 2015)
(Shen et al., 2014)
(Palangi et al., 2015)
(Hu et al., 2014)
(Tai et al., 2015)

But few for long document ranking…
(Guo et al., 2016)
(Salakhutdinov and Hinton, 2009)

Challenges in short vs. long text retrieval
Short-text
Vocabulary mismatch more serious problem
Long-text
Documents contain mixture of many topics
Matches in different parts of the document non-uniformly important
Term proximity is important

The “black swans” of
Information Retrieval
The term black swan originally referred to impossible
events. In 1697, Dutch explorers encountered black
swans for the very first time in western Australia. Since
then, the term is used to refer to surprisingly rare events.
In IR, many query terms and intents are
never observed in the training data
Exact matching is effective in making the
IR model robust to rare events

Desiderata of document ranking
Exact matching
Important if query term is rare / fresh
Frequency and positions of matches
good indicators of relevance
Term proximity is important
Inexact matching
Synonymy relationships
united states president ↔ Obama
Evidence for document aboutness
Documents about Australia likely to contain
related terms like Sydney and koala
Proximity and position is important

Different text representations for matching
Local representation
Terms are considered distinct entities
Term representation is local (one-hot vectors)
Matching is exact (term-level)
Distributed representation
Represent text as dense vectors (embeddings)
Inexact matching in the embedding space
Local (one-hot) representation Distributed representation

A tale of two queries
“pekarovic land company”
Hard to learn good representation for
rare term pekarovic
But easy to estimate relevance based
on patterns of exact matches
Proposal: Learn a neural model to
estimate relevance from patterns of
exact matches
“what channel are the seahawks on today”
Target document likely contains ESPN
or sky sports instead of channel
An embedding model can associate
ESPN in document to channel in query
Proposal: Learn embeddings of text
and match query with document in
the embedding space
The Duet Architecture
Use a neural network to model both functions and learn their parameters jointly

The Duet
architecture
Linear combination of two models
trained jointly on labelled query-
document pairs
Local model operates on lexical
interaction matrix
Distributed model projects n-graph
vectors of text into an embedding
space and then estimates match
Sum
Query text
Generate query
term vector
Doc text
Generate doc
term vector
Generate interaction matrix
Query
term vector
Doc
term vector
Local model
Fully connected layers for matching
Query text
Generate query
embedding
Doc text
Generate doc
embedding
Hadamard product
Query
embedding
Doc
embedding
Distributed model
Fully connected layers for matching

Local model: term interaction matrix
𝑋𝑖,𝑗 =
1, 𝑖𝑓 𝑞𝑖 = 𝑑𝑗
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
In relevant documents,
→Many matches, typically clustered
→Matches localized early in document
→Matches for all query terms
→In-order (phrasal) matches

Local model: estimating relevance
← document words →
Convolve using window of size 𝑛 𝑑 × 1
Each window instance compares a query term w/
whole document
Fully connected layers aggregate evidence
across query terms - can model phrasal matches

Distributed model: input representation
dogs → [ d , o , g , s , #d , do , og , gs , s# , #do , dog , ogs , gs#, #dog, dogs, ogs#, #dogs, dogs# ]
(we consider 2K most popular n-graphs only for encoding)
d o g s h a v e o w n e r s c a t s h a v e s t a f f
n-graph
encoding
concatenate
Channels=2K
[words x channels]

convolutio
n
pooling
Query
embedding
…
…
…
HadamardproductHadamardproductFullyconnected
query document
Distributed model: estimating relevance
Convolve over query and
document terms
Match query with moving
windows over document
Learn text embeddings
specifically for the task
Matching happens in
embedding space
* Network architecture slightly
simplified for visualization – refer paper
for exact details

Putting the two models together…

The Duet model
Training sample: 𝑄, 𝐷+, 𝐷1
−
𝐷2
−
𝐷3
−
𝐷4
−
𝐷+ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒𝑑 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑜𝑟 𝐺𝑜𝑜𝑑
𝐷− = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 2 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝐷+
Optimize cross-entropy loss
Implemented using CNTK (GitHub link)

Data
Need large-scale training data
(labels or clicks)
We use Bing human labelled
data for both train and test

Results
Key finding: Duet performs significantly better than local and distributed
models trained individually

Random negatives vs. judged negatives
Key finding: training w/ judged
bad as negatives significantly
better than w/ random negatives

Local vs. distributed model
Key finding: local and distributed
model performs better on
different segments, but
combination is always better

Effect of training data volume
Key finding: large quantity of training data necessary for learning good
representations, less impactful for training local model

Term
importance
Local model
Only query terms have an impact
Earlier occurrences have bigger impact
Query: united states president
Visualizing impact of dropping terms on model score

Term
importance
Distributed model
Non-query terms (e.g., Obama and
federal) has positive impact on score
Common words like ‘the’ and ‘of’
probably good indicators of well-
formedness of content
Query: united states president
Visualizing impact of dropping terms on model score

Types of models
If we classify models by
query level performance
there is a clear clustering of
lexical (local) and semantic
(distributed) models

Duet on other
IR tasks
Promising early results on TREC
2017 Complex Answer Retrieval
(TREC-CAR)
Duet performs significantly
better when trained on large
data (~32 million samples)
(PAPER UNDER REVIEW)

Summary
Both exact and inexact matching is important for IR
Deep neural networks can be used to model both types of matching
Local model more effective for queries containing rare terms
Distributed model benefits from training on large datasets
Combine local and distributed model to achieve state-of-the-art performance
Get the model:
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb

The Duet model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Duet model

Similar to The Duet model (20)

More from Bhaskar Mitra

More from Bhaskar Mitra (20)

Recently uploaded

Recently uploaded (20)

The Duet model