Overview of the TREC 2019 Deep Learning Track

Nick Craswell Microsoft
Bhaskar Mitra Microsoft, UCL
Emine Yilmaz UCL
Daniel Campos Microsoft

Goal: Large, human-labeled, open IR data
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new datasetsPast: Proprietary data
1+M queries, weak supervision, open 300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
Dehghani, Zamani, Severyn, Kamps and Croft.
Neural ranking models with weak supervision.
SIGIR 2017
More data
Bettersearchresults
TREC 2019 Deep Learning Track

Deriving our TREC 2019 datasets
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 of 10
answers the query
MS MARCO Passage
Retrieval Leaderboard
• Corpus: Union of
10-passage sets
• Labels: From the ~1
positive passage
TREC 2019 Task:
Passage Retrieval
• Same corpus,
training Q+labels
• New reusable NIST
test set
TREC 2019 Task:
Document Retrieval
• Corpus: Documents
(crawl passage urls)
• Labels: Transfer
from passage to doc
• New reusable NIST
test set
http://msmarco.org
https://microsoft.github.io/TREC-2019-Deep-Learning/

Setup of the 2019 deep learning track
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BERT-style language model
• “nn”: Runs that do representation learning
• “trad”: Runs using only traditional IR features (such as BM25 and RM3)
• Subtasks:
• “fullrank”: End-to-end retrieval
• “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25.
Task Training data Test data Corpus
1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents
2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages
* Mostly-overlapping query sets (41 shared)

Dataset availability
• Corpus+train+dev data for both tasks
available now from the DL Track site*
• NIST test sets available to participants now
• [Broader availability in Feb 2020]
* https://microsoft.github.io/TREC-2019-Deep-Learning/

Let’s talk: Baselines, bias,
overfitting, replicability
• Our 2019 test sets can be reused in future papers
• Judging is sufficiently complete. Diverse pools. HiCAL.
• Risk: People make decisions using the test set (i.e. overfit)
• Safer: Submit to TREC 2020, to really prove your point
• One-shot submission, before labels even exist
• Submit runs, or even better, submit docker
• A good TREC track has:
• Many types of model (no cherrypicked comparisons)
• With proper optimization (no straw men)
• And full reporting of results (no publication bias)
• On an unseen test set (no overfitting)

Popular in 2019: Transfer learning
Can be used in “nnlm” and “nn” runs:
• “nnlm” if the pretrained model is BERT or similar
• “nn” if the pretrained model is word2vec or similarhttps://ruder.io/state-of-transfer-learning-in-nlp/
Our large
IR data
Information
retrieval
system

• Official NIST test set: Document task
• Official NIST test set: Passage task
rel #pass %
3 697 8%
2 1804 19%
1 1601 17%
0 5158 56%
rel #doc %
3 841 5%
2 1149 7%
1 4607 28%
0 9661 59%
For metrics that need binary labels, binarize at:

Document retrieval
task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Significant wins over “trad”

Passage retrieval task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Even more significant wins
over “trad”

Subtasks: “fullrank” vs “rerank”
• Many IR systems are multi-stage. For example 
• “rerank” subtask: First stage is shared
• Document task: Rerank top-100 Indri QL
• Passage task: Rerank top-1000 BM25
• Advantages: Easier to participate. Reduces variability.
• “fullrank” subtask: End-to-end retrieval
• Document task: Retrieve from 3.2M
• Passage task: Retrieve from 8.8M
• Advantages: Additional relevant results. Align stages.
Invent a single-stage end-to-end approach.

Metrics
analysis:
Passage
ranking

Metrics
analysis:
Document
ranking

Conclusion
• Two large datasets with 300+K training queries
• Reusable NIST test sets
• “nnlm” does well
• “fullrank” and “rerank” are not that different this year
• DL Track 2020. Repeat. Continue work on large data, transfer learning

Real-world implications
• BM25 is 25 [introduced in TREC-3]
• Used in many TRECs
• Used in many products (“BM25F” too)
• Robust and easy to use
• “nnlm”
• Used in one TREC. Soundly beaten next year?
• Used in products, perhaps?
• Robust? Easy to use? Transferrable?
• What is your go-to ranker today?
• What is your go-to ranker in 10 years?

Overview of the TREC 2019 Deep Learning Track

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overview of the TREC 2019 Deep Learning Track

Similar to Overview of the TREC 2019 Deep Learning Track (20)

Recently uploaded

Recently uploaded (20)

Overview of the TREC 2019 Deep Learning Track