SlideShare a Scribd company logo
1 of 25
Benchmarking
for Neural
Information
Retrieval
Bhaskar Mitra
Principal Applied Scientist
Microsoft
MS MARCO, TREC, and Beyond
@UnderdogGeek bmitra@microsoft.com
What came first, the dataset or the algorithm?
https://www.kdnuggets.com/2016/05/datasets-over-algorithms.html
A brief timeline
of deep learning
for search
2018
2016
2017
2019
2020
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8% I’m certain that deep learning will
come to dominate SIGIR over the
next couple of years … just like
speech, vision, and NLP before it.
Christopher Manning
(SIGIR 2016 Keynote)
Deep learning is
“in the air”
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
Deep learning is
“in the air”
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
Several new deep
learning papers for
document ranking
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
Trained on 200K English
queries from Bing.com
(proprietary dataset)
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
Trained on 95K Chinese
queries from Sogou.com
(public dataset)
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
Trained using BM25-based
weak labels
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
But are we making
real progress?
¯_(ツ)_/¯
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
We launched the MS
MARCO passage
ranking benchmark
with 0.5M+ English
training queriesPassage Ranking Leaderboard
The myth of “no neural IR model worked before BERT”: first generation deep
ranking models, e.g., Duet and KNRM, and their variants, outperform (till date)
most traditional IR methods by a fair margin on the MS MARCO benchmark
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
Google’s BERT paper posted on arXiv
And then…
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
Less than 3 months
after the BERT paper
was uploaded to arXiv
😱First BERT-based ReRanking model
achieves 0.359 MRR compared to
previous SOTA of 0.281 on MS MARCO58%
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
58%
TREC Deep Learning Track
w/ document *and* passage
ranking tasks
2018
2016
2017
2019
2020
% of neural IR
papers @ SIGIR
8%
23%
42%
58%
79%
Document Ranking Leaderboard
TREC 2020 Deep Learning Track
+
Did neural IR really have a “weak baselines” problem?
I will argue NO: (i) pre-MS MARCO, most neural IR papers that benchmarked on Robust04
(or other older TREC collections) were NOT trained on large labeled datasets and represent
a biased sample of neural IR papers, and (ii) even in those cases there is little evidence that
these papers employed any weaker
baselines than non-neural IR papers
But we had a BIGGER benchmarking problem
The lack of public IR benchmarks with large scale training
data led to:
Comparisons under low-data regime
 e.g., older TREC collections with few hundred queries
Comparisons on (semi-)synthetic benchmarks
 e.g., TREC CAR
Comparisons under weak supervision training
Comparisons on corpus of language different than what
the models were designed forPerformance of deep models typically
improve with more training data
(image source: The Duet paper) Non-standardized benchmarks also required reimplementation of baselines (specially,
neural baselines) which in turn meant that many of them were under-tuned, in turn,
contributing to the “weak baselines” problem!
Nope.
A tale of two benchmarking approaches
TREC-style benchmarking
Annual one-shot submission; strongest protection against overfitting to
eval set
Pooling-based judgments after run submission seems most fair to
dramatically new methods (Yilmaz et al., 2020)
However, test sets are too small (only tens of queries) to reliably detect
small improvements
Having only one opportunity to submit runs per year is very limiting for
on-going research
Once the QRELs are public, further evaluation based on the public eval
set is less trustworthy and more likely to suffer from overfitting
Image source: Yilmaz et al., 2020
A tale of two benchmarking approaches
MS MARCO leaderboard-based benchmarking
More flexible submission policy convenient for model development but risks overfitting to eval set
Max. one submission per group per week limit provides some safeguard
Evaluation based on sparse labels collected prior to submissions; provides “instant gratification” for participants but may
under-estimate performance of new methods
The large test set size (thousands of queries) likely more sensitive to smaller improvements
Originally motivated by a need to provide on-going evaluation support (as opposed to TREC’s once-a-year); but the
leaderboard can encourage overly-competitive leaderboard-chasing that muddles any scientific conclusions
Ideally, we should…
Run TREC-style benchmarking but at a higher cadence (e.g., monthly)
with pooling-based judging for a fresh test set, that contains at least
hundreds (if not thousands) of queries, in every cycle.
…But obviously this requires a *LOT* more resources and work than
what we are investing today as a community
But that’s just the tip of the iceberg
Firstly, we have seen huge improvements from BERT on *ONE
dataset*; while we anecdotally know that these models outperform
traditional methods on proprietary industry benchmarks, we need
more large-scale benchmarks to distinguish between general
progress in IR vs. incremental metrics improvement specific to the MS
MARCO collection
Secondly, BERT hasn’t shown dramatic improvements in IR but
specifically on *English queries*; we are not only guilty of
Anglocentrism in our current evaluations but in fact our current
benchmarking approaches are unlikely to even scale to hundreds (or
thousands) of other languages, especially, low resource languages
This is not meant to be a pessimistic message of “Too hard; can’t solve” but rather a call-to-action to the community to come
together and think more ambitiously about what IR benchmarking should look like in, say, five years from now.
And while we are being ambitious, we would also ❤ to get more
participation from the NLP and ML community on the MS MARCO
ranking tasks and the TREC Deep Learning Track
MS MARCO: http://www.msmarco.org/
TREC Deep Learning Track: https://microsoft.github.io/TREC-2020-Deep-Learning/
TREC Deep Learning Quick Start: https://github.com/bmitra-msft/TREC-Deep-Learning-Quick-Start
Download: http://bit.ly/fntir-neural
THANK YOU @UnderdogGeek bmitra@microsoft.com

More Related Content

What's hot

Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Rich Heimann
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesAndre Freitas
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241Urjit Patel
 
Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-AnonymityWaqas Tariq
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalYI-JHEN LIN
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Theoretical Deep Learning
Theoretical Deep LearningTheoretical Deep Learning
Theoretical Deep LearningXiaohu ZHU
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source BriefDataTactics
 
Privacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymityPrivacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymityHiroshi Nakagawa
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 

What's hot (20)

Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-Anonymity
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Theoretical Deep Learning
Theoretical Deep LearningTheoretical Deep Learning
Theoretical Deep Learning
 
Data Tactics Open Source Brief
Data Tactics Open Source BriefData Tactics Open Source Brief
Data Tactics Open Source Brief
 
Privacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymityPrivacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymity
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 

Similar to Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?Bhaskar Mitra
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508okeee
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 
An empirical performance evaluation of relational keyword search systems
An empirical performance evaluation of relational keyword search systemsAn empirical performance evaluation of relational keyword search systems
An empirical performance evaluation of relational keyword search systemsBrowse Jobs
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Smart Data for Smart Labs
Smart Data for Smart Labs Smart Data for Smart Labs
Smart Data for Smart Labs OSTHUS
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
 

Similar to Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond (20)

What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
Sub1579
Sub1579Sub1579
Sub1579
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
10[1].1.1.115.9508
10[1].1.1.115.950810[1].1.1.115.9508
10[1].1.1.115.9508
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
An empirical performance evaluation of relational keyword search systems
An empirical performance evaluation of relational keyword search systemsAn empirical performance evaluation of relational keyword search systems
An empirical performance evaluation of relational keyword search systems
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Smart Data for Smart Labs
Smart Data for Smart Labs Smart Data for Smart Labs
Smart Data for Smart Labs
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 

More from Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Bhaskar Mitra
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 

More from Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

  • 1. Benchmarking for Neural Information Retrieval Bhaskar Mitra Principal Applied Scientist Microsoft MS MARCO, TREC, and Beyond @UnderdogGeek bmitra@microsoft.com
  • 2. What came first, the dataset or the algorithm? https://www.kdnuggets.com/2016/05/datasets-over-algorithms.html
  • 3. A brief timeline of deep learning for search 2018 2016 2017 2019 2020
  • 4. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% I’m certain that deep learning will come to dominate SIGIR over the next couple of years … just like speech, vision, and NLP before it. Christopher Manning (SIGIR 2016 Keynote) Deep learning is “in the air”
  • 5. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% Deep learning is “in the air”
  • 6. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% Several new deep learning papers for document ranking
  • 7. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% Trained on 200K English queries from Bing.com (proprietary dataset)
  • 8. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% Trained on 95K Chinese queries from Sogou.com (public dataset)
  • 9. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% Trained using BM25-based weak labels
  • 10. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% But are we making real progress? ¯_(ツ)_/¯
  • 11. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% We launched the MS MARCO passage ranking benchmark with 0.5M+ English training queriesPassage Ranking Leaderboard The myth of “no neural IR model worked before BERT”: first generation deep ranking models, e.g., Duet and KNRM, and their variants, outperform (till date) most traditional IR methods by a fair margin on the MS MARCO benchmark
  • 12. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% Google’s BERT paper posted on arXiv And then…
  • 13. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% Less than 3 months after the BERT paper was uploaded to arXiv 😱First BERT-based ReRanking model achieves 0.359 MRR compared to previous SOTA of 0.281 on MS MARCO58%
  • 14. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% 58% TREC Deep Learning Track w/ document *and* passage ranking tasks
  • 15. 2018 2016 2017 2019 2020 % of neural IR papers @ SIGIR 8% 23% 42% 58% 79% Document Ranking Leaderboard TREC 2020 Deep Learning Track +
  • 16. Did neural IR really have a “weak baselines” problem? I will argue NO: (i) pre-MS MARCO, most neural IR papers that benchmarked on Robust04 (or other older TREC collections) were NOT trained on large labeled datasets and represent a biased sample of neural IR papers, and (ii) even in those cases there is little evidence that these papers employed any weaker baselines than non-neural IR papers
  • 17. But we had a BIGGER benchmarking problem The lack of public IR benchmarks with large scale training data led to: Comparisons under low-data regime  e.g., older TREC collections with few hundred queries Comparisons on (semi-)synthetic benchmarks  e.g., TREC CAR Comparisons under weak supervision training Comparisons on corpus of language different than what the models were designed forPerformance of deep models typically improve with more training data (image source: The Duet paper) Non-standardized benchmarks also required reimplementation of baselines (specially, neural baselines) which in turn meant that many of them were under-tuned, in turn, contributing to the “weak baselines” problem!
  • 18.
  • 19. Nope.
  • 20. A tale of two benchmarking approaches TREC-style benchmarking Annual one-shot submission; strongest protection against overfitting to eval set Pooling-based judgments after run submission seems most fair to dramatically new methods (Yilmaz et al., 2020) However, test sets are too small (only tens of queries) to reliably detect small improvements Having only one opportunity to submit runs per year is very limiting for on-going research Once the QRELs are public, further evaluation based on the public eval set is less trustworthy and more likely to suffer from overfitting Image source: Yilmaz et al., 2020
  • 21. A tale of two benchmarking approaches MS MARCO leaderboard-based benchmarking More flexible submission policy convenient for model development but risks overfitting to eval set Max. one submission per group per week limit provides some safeguard Evaluation based on sparse labels collected prior to submissions; provides “instant gratification” for participants but may under-estimate performance of new methods The large test set size (thousands of queries) likely more sensitive to smaller improvements Originally motivated by a need to provide on-going evaluation support (as opposed to TREC’s once-a-year); but the leaderboard can encourage overly-competitive leaderboard-chasing that muddles any scientific conclusions
  • 22. Ideally, we should… Run TREC-style benchmarking but at a higher cadence (e.g., monthly) with pooling-based judging for a fresh test set, that contains at least hundreds (if not thousands) of queries, in every cycle. …But obviously this requires a *LOT* more resources and work than what we are investing today as a community
  • 23. But that’s just the tip of the iceberg Firstly, we have seen huge improvements from BERT on *ONE dataset*; while we anecdotally know that these models outperform traditional methods on proprietary industry benchmarks, we need more large-scale benchmarks to distinguish between general progress in IR vs. incremental metrics improvement specific to the MS MARCO collection Secondly, BERT hasn’t shown dramatic improvements in IR but specifically on *English queries*; we are not only guilty of Anglocentrism in our current evaluations but in fact our current benchmarking approaches are unlikely to even scale to hundreds (or thousands) of other languages, especially, low resource languages This is not meant to be a pessimistic message of “Too hard; can’t solve” but rather a call-to-action to the community to come together and think more ambitiously about what IR benchmarking should look like in, say, five years from now.
  • 24. And while we are being ambitious, we would also ❤ to get more participation from the NLP and ML community on the MS MARCO ranking tasks and the TREC Deep Learning Track MS MARCO: http://www.msmarco.org/ TREC Deep Learning Track: https://microsoft.github.io/TREC-2020-Deep-Learning/ TREC Deep Learning Quick Start: https://github.com/bmitra-msft/TREC-Deep-Learning-Quick-Start
  • 25. Download: http://bit.ly/fntir-neural THANK YOU @UnderdogGeek bmitra@microsoft.com