Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Nick Craswell Microsoft
Bhaskar Mitra Microsoft, UCL
Emine Yilmaz UCL
Daniel Campos Microsoft
Goal: Large, human-labeled, open IR data
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new dat...
Deriving our TREC 2019 datasets
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 ...
Setup of the 2019 deep learning track
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BE...
Dataset availability
• Corpus+train+dev data for both tasks
available now from the DL Track site*
• NIST test sets availab...
Let’s talk: Baselines, bias,
overfitting, replicability
• Our 2019 test sets can be reused in future papers
• Judging is s...
Participation
Popular in 2019: Transfer learning
Can be used in “nnlm” and “nn” runs:
• “nnlm” if the pretrained model is BERT or simila...
• Official NIST test set: Document task
• Official NIST test set: Passage task
rel #pass %
3 697 8%
2 1804 19%
1 1601 17%
...
Document retrieval
task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
p...
Document retrieval
task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
p...
Queries
Passage retrieval task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
pe...
Passage retrieval task
• Main metric: NDCG@10
• Focus on top of ranking
• Avoid binarizing labels
• “nnlm” runs tend to
pe...
Queries
Subtasks: “fullrank” vs “rerank”
• Many IR systems are multi-stage. For example 
• “rerank” subtask: First stage is share...
Metrics
analysis:
Passage
ranking
Metrics
analysis:
Passage
ranking
Metrics
analysis:
Document
ranking
Metrics
analysis:
Document
ranking
Conclusion
• Two large datasets with 300+K training queries
• Reusable NIST test sets
• “nnlm” does well
• “fullrank” and ...
Real-world implications
• BM25 is 25 [introduced in TREC-3]
• Used in many TRECs
• Used in many products (“BM25F” too)
• R...
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
Upcoming SlideShare
Loading in …5
×

Overview of the TREC 2019 Deep Learning Track

Overview talk presented at TREC 2019, describing our benchmarking efforts for neural and non-neural information retrieval models in a large data regime.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Overview of the TREC 2019 Deep Learning Track

  1. 1. Nick Craswell Microsoft Bhaskar Mitra Microsoft, UCL Emine Yilmaz UCL Daniel Campos Microsoft
  2. 2. Goal: Large, human-labeled, open IR data 200K queries, human-labeled, proprietary Past: Weak supervision Here: Two new datasetsPast: Proprietary data 1+M queries, weak supervision, open 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 Dehghani, Zamani, Severyn, Kamps and Croft. Neural ranking models with weak supervision. SIGIR 2017 More data Bettersearchresults TREC 2019 Deep Learning Track
  3. 3. Deriving our TREC 2019 datasets MS MARCO QnA Leaderboard • 1M real queries • 10 passages per Q • Human annotation says ~1 of 10 answers the query MS MARCO Passage Retrieval Leaderboard • Corpus: Union of 10-passage sets • Labels: From the ~1 positive passage TREC 2019 Task: Passage Retrieval • Same corpus, training Q+labels • New reusable NIST test set TREC 2019 Task: Document Retrieval • Corpus: Documents (crawl passage urls) • Labels: Transfer from passage to doc • New reusable NIST test set http://msmarco.org https://microsoft.github.io/TREC-2019-Deep-Learning/
  4. 4. Setup of the 2019 deep learning track • Key question: What works best in a large-data regime? • “nnlm”: Runs that use a BERT-style language model • “nn”: Runs that do representation learning • “trad”: Runs using only traditional IR features (such as BM25 and RM3) • Subtasks: • “fullrank”: End-to-end retrieval • “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25. Task Training data Test data Corpus 1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents 2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages * Mostly-overlapping query sets (41 shared)
  5. 5. Dataset availability • Corpus+train+dev data for both tasks available now from the DL Track site* • NIST test sets available to participants now • [Broader availability in Feb 2020] * https://microsoft.github.io/TREC-2019-Deep-Learning/
  6. 6. Let’s talk: Baselines, bias, overfitting, replicability • Our 2019 test sets can be reused in future papers • Judging is sufficiently complete. Diverse pools. HiCAL. • Risk: People make decisions using the test set (i.e. overfit) • Safer: Submit to TREC 2020, to really prove your point • One-shot submission, before labels even exist • Submit runs, or even better, submit docker • A good TREC track has: • Many types of model (no cherrypicked comparisons) • With proper optimization (no straw men) • And full reporting of results (no publication bias) • On an unseen test set (no overfitting)
  7. 7. Participation
  8. 8. Popular in 2019: Transfer learning Can be used in “nnlm” and “nn” runs: • “nnlm” if the pretrained model is BERT or similar • “nn” if the pretrained model is word2vec or similarhttps://ruder.io/state-of-transfer-learning-in-nlp/ Our large IR data Information retrieval system
  9. 9. • Official NIST test set: Document task • Official NIST test set: Passage task rel #pass % 3 697 8% 2 1804 19% 1 1601 17% 0 5158 56% rel #doc % 3 841 5% 2 1149 7% 1 4607 28% 0 9661 59% For metrics that need binary labels, binarize at:
  10. 10. Document retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Significant wins over “trad”
  11. 11. Document retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Significant wins over “trad”
  12. 12. Queries
  13. 13. Passage retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Even more significant wins over “trad”
  14. 14. Passage retrieval task • Main metric: NDCG@10 • Focus on top of ranking • Avoid binarizing labels • “nnlm” runs tend to perform best • Even more significant wins over “trad”
  15. 15. Queries
  16. 16. Subtasks: “fullrank” vs “rerank” • Many IR systems are multi-stage. For example  • “rerank” subtask: First stage is shared • Document task: Rerank top-100 Indri QL • Passage task: Rerank top-1000 BM25 • Advantages: Easier to participate. Reduces variability. • “fullrank” subtask: End-to-end retrieval • Document task: Retrieve from 3.2M • Passage task: Retrieve from 8.8M • Advantages: Additional relevant results. Align stages. Invent a single-stage end-to-end approach.
  17. 17. Metrics analysis: Passage ranking
  18. 18. Metrics analysis: Passage ranking
  19. 19. Metrics analysis: Document ranking
  20. 20. Metrics analysis: Document ranking
  21. 21. Conclusion • Two large datasets with 300+K training queries • Reusable NIST test sets • “nnlm” does well • “fullrank” and “rerank” are not that different this year • DL Track 2020. Repeat. Continue work on large data, transfer learning
  22. 22. Real-world implications • BM25 is 25 [introduced in TREC-3] • Used in many TRECs • Used in many products (“BM25F” too) • Robust and easy to use • “nnlm” • Used in one TREC. Soundly beaten next year? • Used in products, perhaps? • Robust? Easy to use? Transferrable? • What is your go-to ranker today? • What is your go-to ranker in 10 years?

×