Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

Conformer-Kernel
w/ Query Term Independence
@ TREC 2020 Deep Learning Track
Group: MSAI (Microsoft AI & Friends)
Bhaskar Mitra (Microsoft & UCL) bmitra@microsoft.com @UnderdogGeek
Sebastian Hofstätter (TU Wien) s.hofstaetter@tuwien.ac.at @s_hofstaetter
Hamed Zamani (UMass., Amherst) zamani@cs.umass.edu @HamedZamani
Nick Craswell (Microsoft) nickcr@microsoft.com @nick_craswell

A brief history of our work
TK + Conformer + QTI + Duet + ORCAS
(Limited evaluation on public TREC 2019 test set)
Strict blind evaluation @ TREC 2020

The Transformer-Kernel architecture

Transformer  Conformer
GPU memory complexity of self-attention is
quadratic Ο 𝑛2
w.r.t. input length 𝑛
E.g., to grow input by 4 ×, from 𝑛 = 128 to
𝑛 = 512, we need 16 × memory and
documents contain 1000s of tokens
We propose a separable self-attention layer
with GPU memory complexity Ο 𝑛 × 𝑑𝑘𝑒𝑦 ,
which is linear w.r.t. input length 𝑛
Finally, we propose Conformer blocks with:
1. Separable self-attention in place of
standard self-attention
2. A grouped convolution layer before self-
attention to better model local context

CK w/ QTI
To enable full retrieval using the CK model,
we incorporate query term independence
1. Replace query encoder with simple
word embedding lookup
2. Kernel-pooling applied to each row of
the interaction matrix independently to
produce a single document score w.r.t.
each query term
3. Linearly combine scores across all
query terms
By incorporating QTI, we can now computer
all term-document scores offline at indexing
time and use inverted-index for fast retrieval

Explicit term matching (the Duet principle)
We learn a BM25-like term-counting based matching function using gradient descent
Where, , , and denote inverse-document frequency, term frequency, and document
length—and and are the only two learnable parameters of the model
We also define a new BatchScale operation as follows:
The final term-document score is a linear combination of this and the CK score
Where, the BatchNorm operation is defined as:

ORCAS
Before releasing the ORCAS click log data, we
ran initial experiments on TREC 2019 test set
We noticed small improvements by using
ORCAS as (i) additional document field and as
(ii) additional training data
At TREC 2020, we only explored using ORCAS
as an additional description field for documents

Key research questions and results
RQ1. Does explicit term matching improve retrieval quality?
Under rerank setting: No (but note that the candidates are pre-selected based on Indri term matching); under
fullrank setting: Yes (similar observation by Kuzi et al. [2020])
RQ2: Does the retrieval quality improve for our model under the fullrank compared to the rerank setting?
Without explicit term matching: No; With explicit term matching: Yes (on NCG@100 w/o ORCAS and all metrics w/
ORCAS)
RQ3: . Does using ORCAS queries as an additional document description field improve retrieval quality?
Under rerank setting: Yes (on NDCG@10 and AP); Under fullrank setting: Yes (on all metrics)

Query
level
comparison
of
our
best
and
worst
runs

How did we compare to other
participating groups?
Our best run
(ndrm3-orc-full)
Based on the best run per group, we were
among the top 5 participating groups this year
For context, 10 groups submitted at least one
nnlm run this year and all our runs were nn runs
Source: Overview of TREC 2020 by Ellen Voorhees

How did we compare to other
participating groups?
Our best run (ndrm3-orc-full) has the highest NDCG@10 of all nn
runs, and competitive with several nnlm runs
Every run with better NDCG@10 than ndrm3-orc-full uses: (i) Large
scale pretraining, and (ii) multi-stage cascaded ranking
In contrast, ndrm3-orc-full is a single stage retrieval and is trained
from scratch (incl. word embeddings) under 1.5 days using 4x Tesla
P100 GPUs (no pretraining)
We also observe ndrm3-orc-full does well as a candidate
generation technique, as measured by NCG@100
We note that 2 (out of the 3) trad runs that achieve better
NCG@100 incorporate phrasal matching which may be something
to consider for future work in the context of DNN models with QTI

The code
Designed to be easy-to-use for
anyone new or experienced with the
TREC Deep Learning track
Cost of pretraining limits architecture
exploration on top of BERT—NDRM3
can serve as a good baseline for new
architecture/technique development
and for hypothesis testing
Can serve as a baseline for both
rerank and fullrank settings

Get started
with Deep
Learning
for Search
Book
Tutorial
Data
Code
Public
resources
for
neural
IR
research (http://bit.ly/fntir-neural)
(http://bit.ly/deeplearning4search-fire2019)
(http://www.msmarco.org and
https://bit.ly/TREC-DL-2020)
(https://bit.ly/TREC-DL-Quick-Start)

Questions?
Paper: https://arxiv.org/abs/2011.07368
Contact us:
bmitra@microsoft.com
@UnderdogGeek

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

Similar to Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track (20)

More from Bhaskar Mitra

More from Bhaskar Mitra (15)

Recently uploaded

Recently uploaded (20)

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track