We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.
Nell’iperspazio con Rocket: il Framework Web di Rust!
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
1. Conformer-Kernel
w/ Query Term Independence
@ TREC 2020 Deep Learning Track
Group: MSAI (Microsoft AI & Friends)
Bhaskar Mitra (Microsoft & UCL) bmitra@microsoft.com @UnderdogGeek
Sebastian Hofstätter (TU Wien) s.hofstaetter@tuwien.ac.at @s_hofstaetter
Hamed Zamani (UMass., Amherst) zamani@cs.umass.edu @HamedZamani
Nick Craswell (Microsoft) nickcr@microsoft.com @nick_craswell
2. A brief history of our work
TK + Conformer + QTI + Duet + ORCAS
(Limited evaluation on public TREC 2019 test set)
Strict blind evaluation @ TREC 2020
4. Transformer Conformer
GPU memory complexity of self-attention is
quadratic Ο 𝑛2
w.r.t. input length 𝑛
E.g., to grow input by 4 ×, from 𝑛 = 128 to
𝑛 = 512, we need 16 × memory and
documents contain 1000s of tokens
We propose a separable self-attention layer
with GPU memory complexity Ο 𝑛 × 𝑑𝑘𝑒𝑦 ,
which is linear w.r.t. input length 𝑛
Finally, we propose Conformer blocks with:
1. Separable self-attention in place of
standard self-attention
2. A grouped convolution layer before self-
attention to better model local context
5. CK w/ QTI
To enable full retrieval using the CK model,
we incorporate query term independence
1. Replace query encoder with simple
word embedding lookup
2. Kernel-pooling applied to each row of
the interaction matrix independently to
produce a single document score w.r.t.
each query term
3. Linearly combine scores across all
query terms
By incorporating QTI, we can now computer
all term-document scores offline at indexing
time and use inverted-index for fast retrieval
6. Explicit term matching (the Duet principle)
We learn a BM25-like term-counting based matching function using gradient descent
Where, , , and denote inverse-document frequency, term frequency, and document
length—and and are the only two learnable parameters of the model
We also define a new BatchScale operation as follows:
The final term-document score is a linear combination of this and the CK score
Where, the BatchNorm operation is defined as:
7. ORCAS
Before releasing the ORCAS click log data, we
ran initial experiments on TREC 2019 test set
We noticed small improvements by using
ORCAS as (i) additional document field and as
(ii) additional training data
At TREC 2020, we only explored using ORCAS
as an additional description field for documents
8. Key research questions and results
RQ1. Does explicit term matching improve retrieval quality?
Under rerank setting: No (but note that the candidates are pre-selected based on Indri term matching); under
fullrank setting: Yes (similar observation by Kuzi et al. [2020])
RQ2: Does the retrieval quality improve for our model under the fullrank compared to the rerank setting?
Without explicit term matching: No; With explicit term matching: Yes (on NCG@100 w/o ORCAS and all metrics w/
ORCAS)
RQ3: . Does using ORCAS queries as an additional document description field improve retrieval quality?
Under rerank setting: Yes (on NDCG@10 and AP); Under fullrank setting: Yes (on all metrics)
10. How did we compare to other
participating groups?
Our best run
(ndrm3-orc-full)
Based on the best run per group, we were
among the top 5 participating groups this year
For context, 10 groups submitted at least one
nnlm run this year and all our runs were nn runs
Source: Overview of TREC 2020 by Ellen Voorhees
11. How did we compare to other
participating groups?
Our best run (ndrm3-orc-full) has the highest NDCG@10 of all nn
runs, and competitive with several nnlm runs
Every run with better NDCG@10 than ndrm3-orc-full uses: (i) Large
scale pretraining, and (ii) multi-stage cascaded ranking
In contrast, ndrm3-orc-full is a single stage retrieval and is trained
from scratch (incl. word embeddings) under 1.5 days using 4x Tesla
P100 GPUs (no pretraining)
We also observe ndrm3-orc-full does well as a candidate
generation technique, as measured by NCG@100
We note that 2 (out of the 3) trad runs that achieve better
NCG@100 incorporate phrasal matching which may be something
to consider for future work in the context of DNN models with QTI
12. The code
Designed to be easy-to-use for
anyone new or experienced with the
TREC Deep Learning track
Cost of pretraining limits architecture
exploration on top of BERT—NDRM3
can serve as a good baseline for new
architecture/technique development
and for hypothesis testing
Can serve as a baseline for both
rerank and fullrank settings
13. Get started
with Deep
Learning
for Search
Book
Tutorial
Data
Code
Public
resources
for
neural
IR
research (http://bit.ly/fntir-neural)
(http://bit.ly/deeplearning4search-fire2019)
(http://www.msmarco.org and
https://bit.ly/TREC-DL-2020)
(https://bit.ly/TREC-DL-Quick-Start)