SlideShare a Scribd company logo
1 of 13
Benchmarking
search relevance
in industry vs
academia
Nick Craswell
Principal Group Science Manager
Microsoft WebXT
Benchmarking search relevance
• Search task: Retrieve documents in response to a query
• Benchmark data: Queries, Corpus, Judgments (a test collection)
• Application-specific benchmarks -> Lots of room for optimization+ML e.g.
incorporating temporal factors in a news search product
• Core IR benchmarks (flat Q, flat D) -> Not always making progress?*
• Core IR task is important
• Unsolved. Fundamental. Building block
• Need benchmarks to encourage progress
* Armstrong, Moffatt, Webber, Zobel.
Improvements That Don’t Add Up:
Ad-Hoc Retrieval Results Since 1998.
CIKM 2009
What does progress look like?
Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999)
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
AveragePrecision
Progress
TREC-1 system (1992)
TREC-7 system (1998)
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
A. Avoiding test data reuse
• Using multiple querysets in industry
• Make many decisions using queryset 1, few on 2, none on 3
• Refresh querysets often
• Academia: 1) Multiple test collections, 2) Leaderboards can reduce
iteration, 3) Most convincing is one-time submission (e.g. TREC)
• Thought experiment:
Queryset 1:
Find an improvement
Queryset 2: Choose a
release candidate
Queryset 3: Post-
release measurement
B. Production baseline
• Evaluate production ranker changes, which we want to deploy
• Pro: Avoid the weak baseline problem
• Con: Repeated incremental improvements increase complexity
• Pro: Improvements can add up
• Academic options:
• Not sure!
• Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky
• I would trust a high-ish baseline with SS gains e.g. two runs from one group
Ben Carterette. 2015. The Best Published Result is Random: Sequential
Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.
C. Get more data
200K queries, human-labeled, proprietary
Academic data release:
MS MARCO and TREC DL
In industry
300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
More data
Bettersearchresults
DNN vs 1990s IR
Artist’s impression of total victory
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
Blind
Test
AveragePrecision
TREC-1 SMART
TREC-2 SMART
TREC-3 SMART
TREC-4 SMART
TREC-5 SMART
TREC-6 SMART
TREC-7 SMART
TREC-26+ DNN
Nick Craswell. Neural Models for Full Text Search: Could the
improvements add up? WSDM 2017 Practice and Experience Talk
• We decided to release data: Labels, clicks, etc
• Public leaderboard and TREC track (and code)
• Part of a larger open effort “AI at Scale”
Our external ranking benchmarks
TREC Deep Learning Track
https://msmarco.org
BM25
BERT
Leader
Conclusion: Industry perspective on academia
• Reusing test collections a lot is not something we’d advise
• Are you sure you made no decisions based on robust04
• What if you had another robust04. Would your conclusions stand up?
• Submit to TREC, this is the most reliable way of avoiding overfitting
• With large training data we can significantly beat 1990s methods on
core IR tasks e.g. BERT-style DNN rankers
• Not sure how to handle baselines in academia
• Would trust an experiment where baseline is not too low and there’s a gain

More Related Content

Similar to Benchmarking search relevance in industry vs academia

Similar to Benchmarking search relevance in industry vs academia (20)

how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestions
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Benchmarking search relevance in industry vs academia

  • 1. Benchmarking search relevance in industry vs academia Nick Craswell Principal Group Science Manager Microsoft WebXT
  • 2. Benchmarking search relevance • Search task: Retrieve documents in response to a query • Benchmark data: Queries, Corpus, Judgments (a test collection) • Application-specific benchmarks -> Lots of room for optimization+ML e.g. incorporating temporal factors in a news search product • Core IR benchmarks (flat Q, flat D) -> Not always making progress?* • Core IR task is important • Unsolved. Fundamental. Building block • Need benchmarks to encourage progress * Armstrong, Moffatt, Webber, Zobel. Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. CIKM 2009
  • 3. What does progress look like? Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999) 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task AveragePrecision Progress TREC-1 system (1992) TREC-7 system (1998)
  • 4. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 5. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 6. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data
  • 7. A. Avoiding test data reuse • Using multiple querysets in industry • Make many decisions using queryset 1, few on 2, none on 3 • Refresh querysets often • Academia: 1) Multiple test collections, 2) Leaderboards can reduce iteration, 3) Most convincing is one-time submission (e.g. TREC) • Thought experiment: Queryset 1: Find an improvement Queryset 2: Choose a release candidate Queryset 3: Post- release measurement
  • 8. B. Production baseline • Evaluate production ranker changes, which we want to deploy • Pro: Avoid the weak baseline problem • Con: Repeated incremental improvements increase complexity • Pro: Improvements can add up • Academic options: • Not sure! • Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky • I would trust a high-ish baseline with SS gains e.g. two runs from one group Ben Carterette. 2015. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.
  • 9. C. Get more data 200K queries, human-labeled, proprietary Academic data release: MS MARCO and TREC DL In industry 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 More data Bettersearchresults
  • 10. DNN vs 1990s IR Artist’s impression of total victory 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task Blind Test AveragePrecision TREC-1 SMART TREC-2 SMART TREC-3 SMART TREC-4 SMART TREC-5 SMART TREC-6 SMART TREC-7 SMART TREC-26+ DNN Nick Craswell. Neural Models for Full Text Search: Could the improvements add up? WSDM 2017 Practice and Experience Talk
  • 11. • We decided to release data: Labels, clicks, etc • Public leaderboard and TREC track (and code) • Part of a larger open effort “AI at Scale”
  • 12. Our external ranking benchmarks TREC Deep Learning Track https://msmarco.org BM25 BERT Leader
  • 13. Conclusion: Industry perspective on academia • Reusing test collections a lot is not something we’d advise • Are you sure you made no decisions based on robust04 • What if you had another robust04. Would your conclusions stand up? • Submit to TREC, this is the most reliable way of avoiding overfitting • With large training data we can significantly beat 1990s methods on core IR tasks e.g. BERT-style DNN rankers • Not sure how to handle baselines in academia • Would trust an experiment where baseline is not too low and there’s a gain