This document summarizes the goals and setup of the 2019 TREC deep learning track for passage and document retrieval. It describes the datasets used, which include over 300,000 queries with human labels for training and new test sets. It discusses the types of models submitted, with "nnlm" models using BERT performing best. Metrics are analyzed showing these models outperform traditional "trad" baselines. The document also considers implications for real-world search systems.
2. Goal: Large, human-labeled, open IR data
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new datasetsPast: Proprietary data
1+M queries, weak supervision, open 300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
Dehghani, Zamani, Severyn, Kamps and Croft.
Neural ranking models with weak supervision.
SIGIR 2017
More data
Bettersearchresults
TREC 2019 Deep Learning Track
3. Deriving our TREC 2019 datasets
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 of 10
answers the query
MS MARCO Passage
Retrieval Leaderboard
• Corpus: Union of
10-passage sets
• Labels: From the ~1
positive passage
TREC 2019 Task:
Passage Retrieval
• Same corpus,
training Q+labels
• New reusable NIST
test set
TREC 2019 Task:
Document Retrieval
• Corpus: Documents
(crawl passage urls)
• Labels: Transfer
from passage to doc
• New reusable NIST
test set
http://msmarco.org
https://microsoft.github.io/TREC-2019-Deep-Learning/
4. Setup of the 2019 deep learning track
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BERT-style language model
• “nn”: Runs that do representation learning
• “trad”: Runs using only traditional IR features (such as BM25 and RM3)
• Subtasks:
• “fullrank”: End-to-end retrieval
• “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25.
Task Training data Test data Corpus
1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents
2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages
* Mostly-overlapping query sets (41 shared)
5. Dataset availability
• Corpus+train+dev data for both tasks
available now from the DL Track site*
• NIST test sets available to participants now
• [Broader availability in Feb 2020]
* https://microsoft.github.io/TREC-2019-Deep-Learning/
6. Let’s talk: Baselines, bias,
overfitting, replicability
• Our 2019 test sets can be reused in future papers
• Judging is sufficiently complete. Diverse pools. HiCAL.
• Risk: People make decisions using the test set (i.e. overfit)
• Safer: Submit to TREC 2020, to really prove your point
• One-shot submission, before labels even exist
• Submit runs, or even better, submit docker
• A good TREC track has:
• Many types of model (no cherrypicked comparisons)
• With proper optimization (no straw men)
• And full reporting of results (no publication bias)
• On an unseen test set (no overfitting)
8. Popular in 2019: Transfer learning
Can be used in “nnlm” and “nn” runs:
• “nnlm” if the pretrained model is BERT or similar
• “nn” if the pretrained model is word2vec or similarhttps://ruder.io/state-of-transfer-learning-in-nlp/
Our large
IR data
Information
retrieval
system
9. • Official NIST test set: Document task
• Official NIST test set: Passage task
rel #pass %
3 697 8%
2 1804 19%
1 1601 17%
0 5158 56%
rel #doc %
3 841 5%
2 1149 7%
1 4607 28%
0 9661 59%
For metrics that need binary labels, binarize at:
10. Document retrieval
task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Significant wins over “trad”
11. Document retrieval
task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Significant wins over “trad”
14. Passage retrieval task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Even more significant wins
over “trad”
15. Passage retrieval task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
perform best
• Even more significant wins
over “trad”
24. Conclusion
• Two large datasets with 300+K training queries
• Reusable NIST test sets
• “nnlm” does well
• “fullrank” and “rerank” are not that different this year
• DL Track 2020. Repeat. Continue work on large data, transfer learning
25. Real-world implications
• BM25 is 25 [introduced in TREC-3]
• Used in many TRECs
• Used in many products (“BM25F” too)
• Robust and easy to use
• “nnlm”
• Used in one TREC. Soundly beaten next year?
• Used in products, perhaps?
• Robust? Easy to use? Transferrable?
• What is your go-to ranker today?
• What is your go-to ranker in 10 years?