Benchmarking search relevance in industry vs academia

•Download as PPTX, PDF•

1 like•345 views

Update of my WSDM2017 practice and experience talk (also on slideshare) talking about lessons from industry on the use of offline metrics in information retrieval. Since a key thing is to have more training and test sets, this talk describes our more recent data releases.

Science

Benchmarking search relevance
• Search task: Retrieve documents in response to a query
• Benchmark data: Queries, Corpus, Judgments (a test collection)
• Application-specific benchmarks -> Lots of room for optimization+ML e.g.
incorporating temporal factors in a news search product
• Core IR benchmarks (flat Q, flat D) -> Not always making progress?*
• Core IR task is important
• Unsolved. Fundamental. Building block
• Need benchmarks to encourage progress
* Armstrong, Moffatt, Webber, Zobel.
Improvements That Don’t Add Up:
Ad-Hoc Retrieval Results Since 1998.
CIKM 2009

What does progress look like?
Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999)
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
AveragePrecision
Progress
TREC-1 system (1992)
TREC-7 system (1998)

Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data

A. Avoiding test data reuse
• Using multiple querysets in industry
• Make many decisions using queryset 1, few on 2, none on 3
• Refresh querysets often
• Academia: 1) Multiple test collections, 2) Leaderboards can reduce
iteration, 3) Most convincing is one-time submission (e.g. TREC)
• Thought experiment:
Queryset 1:
Find an improvement
Queryset 2: Choose a
release candidate
Queryset 3: Post-
release measurement

B. Production baseline
• Evaluate production ranker changes, which we want to deploy
• Pro: Avoid the weak baseline problem
• Con: Repeated incremental improvements increase complexity
• Pro: Improvements can add up
• Academic options:
• Not sure!
• Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky
• I would trust a high-ish baseline with SS gains e.g. two runs from one group
Ben Carterette. 2015. The Best Published Result is Random: Sequential
Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.

C. Get more data
200K queries, human-labeled, proprietary
Academic data release:
MS MARCO and TREC DL
In industry
300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
More data
Bettersearchresults

DNN vs 1990s IR
Artist’s impression of total victory
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
Blind
Test
AveragePrecision
TREC-1 SMART
TREC-2 SMART
TREC-3 SMART
TREC-4 SMART
TREC-5 SMART
TREC-6 SMART
TREC-7 SMART
TREC-26+ DNN
Nick Craswell. Neural Models for Full Text Search: Could the
improvements add up? WSDM 2017 Practice and Experience Talk

• We decided to release data: Labels, clicks, etc
• Public leaderboard and TREC track (and code)
• Part of a larger open effort “AI at Scale”

Our external ranking benchmarks
TREC Deep Learning Track
https://msmarco.org
BM25
BERT
Leader

Conclusion: Industry perspective on academia
• Reusing test collections a lot is not something we’d advise
• Are you sure you made no decisions based on robust04
• What if you had another robust04. Would your conclusions stand up?
• Submit to TREC, this is the most reliable way of avoiding overfitting
• With large training data we can significantly beat 1990s methods on
core IR tasks e.g. BERT-style DNN rankers
• Not sure how to handle baselines in academia
• Would trust an experiment where baseline is not too low and there’s a gain

Similar to Benchmarking search relevance in industry vs academia

how to build a Length of Stay model for a ProofOfConcept project

Zenodia Charpy

Overview of the TREC 2019 Deep Learning Track

Nick Craswell

Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern. This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.

Doing Analytics Right - Building the Analytics Environment

Tasktop

Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.

Can we induce change with what we measure?

Michaela Greiler

Recommender System Challenges such as the Netflix Prize, KDD Cup, etc. have contributed vastly to the development and adoptability of recommender systems. Each year a number of challenges or contests are organized covering different aspects of recommendation. In this tutorial and panel, we present some of the factors involved in successfully organizing a challenge, whether for reasons purely related to research, industrial challenges, or to widen the scope of recommender systems applications.

Best Practices in Recommender System Challenges

Alan Said

Learning by example: training users through high-quality query suggestions

Claudia Hauff

Which institute is best for data science?

DIGITALSAI1

Best Selenium certification course

KumarNaik21

Data Science Online Training In HA comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge.hyderabad Data Science Online Training #datascienceonlinetraininginhyderabad #datascienceonline #datascienceonlinetraining #datascience

Data science training in hyd ppt (1)

SayyedYusufali

Exploring the EduXfactor Data Science Training program, you will learn components of the Data Science lifecycle such as Big Data, Hadoop, Machine Learning, Deep Learning & R programming. Our professional experts will teach you how to adopt a blend of mathematics, statistics, business acumen, tools, algorithms & machine learning techniques. You will learn how to handle a large amount of data information & process it according to any firm business strategy.

Data science training institute in hyderabad

VamsiNihal

Data science training in Hyderabad

saitejavella

Data science training Hyderabad

Nithinsunil1

Data science online training in hyderabad

VamsiNihal

Overview of Data Science Courses Online A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. What You'll Learn In Data Science Courses Online Grasp the key fundamentals of data science, coding, and machine learning. Develop mastery over essential analytic tools like R, Python, SQL, and more. Comprehend the crucial steps required to solve real-world data problems and get familiar with the methodology to think and work like a Data Scientist. Learn to collect, clean, and analyze big data with R. Understand how to employ appropriate modeling and methods of analytics to extract meaningful data for decision making. Implement clustering methodology, an unsupervised learning method, and a deep neural network (a supervised learning method). Build a data analysis pipeline, from collection to analysis to presenting data visually. #datasciencecoursesonline #datascience #datasciencecourses

Data science training in hyd ppt (1)

SayyedYusufali

data science training and placement

SaiprasadVella

online data science training

DIGITALSAI1

Data science online training in hyderabad

VamsiNihal

A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. Grasp the key fundamentals of data science, coding, and machine learning. Develop mastery over essential analytic tools like R, Python, SQL, and more.

data science online training in hyderabad

VamsiNihal

Best data science training in Hyderabad

KumarNaik21

Data science training Hyderabad

Nithinsunil1

Similar to Benchmarking search relevance in industry vs academia (20)

how to build a Length of Stay model for a ProofOfConcept project

Overview of the TREC 2019 Deep Learning Track

Doing Analytics Right - Building the Analytics Environment

Can we induce change with what we measure?

Best Practices in Recommender System Challenges

Learning by example: training users through high-quality query suggestions

Which institute is best for data science?

Best Selenium certification course

Data science training in hyd ppt (1)

Data science training institute in hyderabad

Data science training in Hyderabad

Data science training Hyderabad

Data science online training in hyderabad

Data science training in hyd ppt (1)

data science training and placement

online data science training

Data science online training in hyderabad

data science online training in hyderabad

Best data science training in Hyderabad

Data science training Hyderabad

Recently uploaded

Context. WASP-76 b has been a recurrent subject of study since the detection of a signature in high-resolution transit spectroscopy data indicating an asymmetry between the two limbs of the planet. The existence of this asymmetric signature has been confirmed by multiple studies, but its physical origin is still under debate. In addition, it contrasts with the absence of asymmetry reported in the infrared (IR) phase curve. Aims. We provide a more comprehensive dataset of WASP-76 b with the goal of drawing a complete view of the physical processes at work in this atmosphere. In particular, we attempt to reconcile visible high-resolution transit spectroscopy data and IR broadband phase curves. Methods. We gathered 3 phase curves, 20 occultations, and 6 transits for WASP-76 b in the visible with the CHEOPS space telescope. We also report the analysis of three unpublished sectors observed by the TESS space telescope (also in the visible), which represents 34 phase curves. Results. WASP-76 b displays an occultation of 260±11 and 152±10 ppm in TESS and CHEOPS bandpasses respectively. Depending on the composition assumed for the atmosphere and the data reduction used for the IR data, we derived geometric albedo estimates that range from 0.05 ± 0.023 to 0.146 ± 0.013 and from <0.13 to 0.189 ± 0.017 in the CHEOPS and TESS bandpasses, respectively. As expected from the IR phase curves, a low-order model of the phase curves does not yield any detectable asymmetry in the visible either. However, an empirical model allowing for sharper phase curve variations offers a hint of a flux excess before the occultation, with an amplitude of ∼40 ppm, an orbital offset of ∼−30◦ , and a width of ∼20◦ . We also constrained the orbital eccentricity of WASP-76 b to a value lower than 0.0067, with a 99.7% confidence level. This result contradicts earlier proposed scenarios aimed at explaining the asymmetry observed in high-resolution transit spectroscopy. Conclusions. In light of these findings, we hypothesise that WASP-76 b could have night-side clouds that extend predominantly towards its eastern limb. At this limb, the clouds would be associated with spherical droplets or spherically shaped aerosols of an unknown species, which would be responsible for a glory effect in the visible phase curves.

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b

Sérgio Sacani

FAIRSpectra - Enabling the FAIRification of Analytical Science

Alex Henderson

Criminology is the scientific study of crime, criminals,and the criminal justice system. It is an interdisciplinaryfield that draws upon knowledge and methodologiesfrom sociology, psychology, law, biology, statistics, andother related disciplines. Criminologists examine variousaspects of crime, including its causes, consequences,prevention, and control. Three broad models of criminal behaviorsare the following: psychological,sociological and biological models. The primary goal of criminology is to understand whyindividuals commit crimes and to develop effective strategiesfor crime prevention and reduction. Criminologists study the social, economic, and psychological factors that contributeto criminal behavior.

Exploring Criminology and Criminal Behaviour.pdf

rohankumarsinghrore1

Selaginella: features, morphology ,anatomy and reproduction.

Silpa

development of diagnostic enzyme assay to detect leuser virus

NazaninKarimi6

Grade 7 - Lesson 1 - Microscope and Its Functions

OrtegaSyrineMay

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

Alex Henderson

module for grade 9 for distance learning

levieagacer

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...

Monika Rani

Proteomics: types, protein profiling steps etc.

Silpa

Dr. E. Muralinath_ Blood indices_clinical aspects

muralinath2

Bacterial Identification and Classifications

Areesha Ahmad

Velocity and Acceleration PowerPoint.ppt

RakeshMohan42

GBSN - Microbiology (Unit 1)

Areesha Ahmad

GBSN - Microbiology (Unit 2)

Areesha Ahmad

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE low price Call me 8617370543 100%genuine sexy VIP call girl safe service WhatsApp chat. 8617370543 Normal call kijiye 8617370543 100% genuine young college girl and housewife Full enjoy open minded girl provide 24hour full cooperative model and full satisfactions☎️*8617370543*⭐Escorts service █▬█⓿▀█▀ call girls and bhabhi available for room Sex and video call service A-1 HIGH CLASS CALL GIRLS TOP MODEL 24X7 HOME/HOTEL Call Girls Safe & Secure High Class Sm Affordable Rate 100% Satisfaction, Unlimited Enjoyment. Any Time for Model/Escort in High class luxury and premium

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE

Goa Call Girls High Profile Escorts

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA

Dr. TATHAGAT KHOBRAGADE

Chemistry 5th semester paper 1st Notes.pdf

Sumit Kumar yadav

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx

DiariAli

+971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Clinic in Abu Dhabi, (United Arab Emirates)+971581248768

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b

FAIRSpectra - Enabling the FAIRification of Analytical Science

Exploring Criminology and Criminal Behaviour.pdf

Selaginella: features, morphology ,anatomy and reproduction.

development of diagnostic enzyme assay to detect leuser virus

Grade 7 - Lesson 1 - Microscope and Its Functions

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

module for grade 9 for distance learning

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...

Proteomics: types, protein profiling steps etc.

Dr. E. Muralinath_ Blood indices_clinical aspects

Bacterial Identification and Classifications

Velocity and Acceleration PowerPoint.ppt

GBSN - Microbiology (Unit 1)

GBSN - Microbiology (Unit 2)

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA

Chemistry 5th semester paper 1st Notes.pdf

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Benchmarking search relevance in industry vs academia

1. Benchmarking search relevance in industry vs academia Nick Craswell Principal Group Science Manager Microsoft WebXT

2. Benchmarking search relevance • Search task: Retrieve documents in response to a query • Benchmark data: Queries, Corpus, Judgments (a test collection) • Application-specific benchmarks -> Lots of room for optimization+ML e.g. incorporating temporal factors in a news search product • Core IR benchmarks (flat Q, flat D) -> Not always making progress?* • Core IR task is important • Unsolved. Fundamental. Building block • Need benchmarks to encourage progress * Armstrong, Moffatt, Webber, Zobel. Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. CIKM 2009

3. What does progress look like? Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999) 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task AveragePrecision Progress TREC-1 system (1992) TREC-7 system (1998)

4. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data

5. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data

6. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. SIGIR 2019. Three comments on this: A. Test data is reused too much B. Baseline is unclear C. Not enough training data

7. A. Avoiding test data reuse • Using multiple querysets in industry • Make many decisions using queryset 1, few on 2, none on 3 • Refresh querysets often • Academia: 1) Multiple test collections, 2) Leaderboards can reduce iteration, 3) Most convincing is one-time submission (e.g. TREC) • Thought experiment: Queryset 1: Find an improvement Queryset 2: Choose a release candidate Queryset 3: Post- release measurement

8. B. Production baseline • Evaluate production ranker changes, which we want to deploy • Pro: Avoid the weak baseline problem • Con: Repeated incremental improvements increase complexity • Pro: Improvements can add up • Academic options: • Not sure! • Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky • I would trust a high-ish baseline with SS gains e.g. two runs from one group Ben Carterette. 2015. The Best Published Result is Random: Sequential Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.

9. C. Get more data 200K queries, human-labeled, proprietary Academic data release: MS MARCO and TREC DL In industry 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 More data Bettersearchresults

10. DNN vs 1990s IR Artist’s impression of total victory 0 0.1 0.2 0.3 0.4 0.5 0.6 TREC-1 Task TREC-2 Task TREC-3 Task TREC-4 Task TREC-5 Task TREC-6 Task TD TREC-6 Task D TREC-7 Task Blind Test AveragePrecision TREC-1 SMART TREC-2 SMART TREC-3 SMART TREC-4 SMART TREC-5 SMART TREC-6 SMART TREC-7 SMART TREC-26+ DNN Nick Craswell. Neural Models for Full Text Search: Could the improvements add up? WSDM 2017 Practice and Experience Talk

11. • We decided to release data: Labels, clicks, etc • Public leaderboard and TREC track (and code) • Part of a larger open effort “AI at Scale”

12. Our external ranking benchmarks TREC Deep Learning Track https://msmarco.org BM25 BERT Leader

13. Conclusion: Industry perspective on academia • Reusing test collections a lot is not something we’d advise • Are you sure you made no decisions based on robust04 • What if you had another robust04. Would your conclusions stand up? • Submit to TREC, this is the most reliable way of avoiding overfitting • With large training data we can significantly beat 1990s methods on core IR tasks e.g. BERT-style DNN rankers • Not sure how to handle baselines in academia • Would trust an experiment where baseline is not too low and there’s a gain

Benchmarking search relevance in industry vs academia

Recommended

Recommended

More Related Content

Similar to Benchmarking search relevance in industry vs academia

Similar to Benchmarking search relevance in industry vs academia (20)

Recently uploaded

Recently uploaded (20)

Benchmarking search relevance in industry vs academia