Update of my WSDM2017 practice and experience talk (also on slideshare) talking about lessons from industry on the use of offline metrics in information retrieval. Since a key thing is to have more training and test sets, this talk describes our more recent data releases.
2. Benchmarking search relevance
• Search task: Retrieve documents in response to a query
• Benchmark data: Queries, Corpus, Judgments (a test collection)
• Application-specific benchmarks -> Lots of room for optimization+ML e.g.
incorporating temporal factors in a news search product
• Core IR benchmarks (flat Q, flat D) -> Not always making progress?*
• Core IR task is important
• Unsolved. Fundamental. Building block
• Need benchmarks to encourage progress
* Armstrong, Moffatt, Webber, Zobel.
Improvements That Don’t Add Up:
Ad-Hoc Retrieval Results Since 1998.
CIKM 2009
3. What does progress look like?
Chris Buckley, Mandar Mitra, Janet A. Walz, and Claire Cardie. "SMART high precision: TREC 7." NIST Special Publication 500-242 TREC-7 (1999)
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
AveragePrecision
Progress
TREC-1 system (1992)
TREC-7 system (1998)
4. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
5. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
6. Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the “Neural Hype” Weak Baselines and the Additivity of Effectiveness Gains from
Neural Ranking Models. SIGIR 2019.
Three comments on this:
A. Test data is reused too much
B. Baseline is unclear
C. Not enough training data
7. A. Avoiding test data reuse
• Using multiple querysets in industry
• Make many decisions using queryset 1, few on 2, none on 3
• Refresh querysets often
• Academia: 1) Multiple test collections, 2) Leaderboards can reduce
iteration, 3) Most convincing is one-time submission (e.g. TREC)
• Thought experiment:
Queryset 1:
Find an improvement
Queryset 2: Choose a
release candidate
Queryset 3: Post-
release measurement
8. B. Production baseline
• Evaluate production ranker changes, which we want to deploy
• Pro: Avoid the weak baseline problem
• Con: Repeated incremental improvements increase complexity
• Pro: Improvements can add up
• Academic options:
• Not sure!
• Winners at TREC/leaderboard may be lucky. Strongest baseline is also lucky
• I would trust a high-ish baseline with SS gains e.g. two runs from one group
Ben Carterette. 2015. The Best Published Result is Random: Sequential
Testing and Its Effect on Reported Effectiveness. In SIGIR ’15.
9. C. Get more data
200K queries, human-labeled, proprietary
Academic data release:
MS MARCO and TREC DL
In industry
300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
More data
Bettersearchresults
10. DNN vs 1990s IR
Artist’s impression of total victory
0
0.1
0.2
0.3
0.4
0.5
0.6
TREC-1
Task
TREC-2
Task
TREC-3
Task
TREC-4
Task
TREC-5
Task
TREC-6
Task TD
TREC-6
Task D
TREC-7
Task
Blind
Test
AveragePrecision
TREC-1 SMART
TREC-2 SMART
TREC-3 SMART
TREC-4 SMART
TREC-5 SMART
TREC-6 SMART
TREC-7 SMART
TREC-26+ DNN
Nick Craswell. Neural Models for Full Text Search: Could the
improvements add up? WSDM 2017 Practice and Experience Talk
11. • We decided to release data: Labels, clicks, etc
• Public leaderboard and TREC track (and code)
• Part of a larger open effort “AI at Scale”
13. Conclusion: Industry perspective on academia
• Reusing test collections a lot is not something we’d advise
• Are you sure you made no decisions based on robust04
• What if you had another robust04. Would your conclusions stand up?
• Submit to TREC, this is the most reliable way of avoiding overfitting
• With large training data we can significantly beat 1990s methods on
core IR tasks e.g. BERT-style DNN rankers
• Not sure how to handle baselines in academia
• Would trust an experiment where baseline is not too low and there’s a gain