Online Learning to Rank

Образец заголовка
Online Learning to Rank
by Edward W Huang (ewhuang3) and
Yerzhan Suleimenov (suleime1)
Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Introduction

Образец заголовкаWhat is learning to rank?
• Many information retrieval problems are ranking problems
• Also known as machine-learned ranking
– Uses machine learning techniques to create ranking models
• Training data: queries and documents matched with relevance
judgements
– Model sorts objects by relevance, preference, or importance
– Finds optimal combination of features

Образец заголовкаApplications of learning to rank
• Ranking problems in information retrieval
– Document Retrieval
– Sentiment analysis
– Product rating
– Anti-spam measures
– Search engines
• Many more applications not just in information retrieval!
– Machine translation
– Computational biology

Образец заголовкаOnline vs. offline learning to rank
• Training set is produced by human assessors
(offline)
– Time consuming and expensive to produce
– Not always in line with actual user preferences
• Data of users interacting with system (online)
– Users leave trace of interaction data: query
reformulations, mouse movements, clicks, etc.
– Clicks especially valuable when interpreted as
preferences

Big issue with online learning to
rank
• Exploration-exploitation dilemma
– Have to obtain feedback to improve system, while also utilizing past models to
optimize result quality
– Discuss solutions later

Creating Ranking Models

Образец заголовкаRanking model training framework
• Discriminative training attributes
– Input space
– Output space
– Hypothesis space
– Loss function
• Ranking model predicts ground truth label in training set in terms
of loss function
• Test phase: new query arrives, trained ranking model sorts
documents according to relevance to query

Algorithms for learning to rank
problems
• Categorized into three groups by their framework (input
representation and loss function)
– Pointwise
– Pairwise
– Listwise
T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval,
3(3): 225–331, 2009.

Образец заголовкаLimitations of pointwise approach
• Does not consider interdependency among documents
• Does not make use of the fact that some documents are
associated with the same query
• Most IR evaluation measures are query-level and position-based

Образец заголовкаPairwise and listwise
• Potential solutions to the previously mentioned exploration-exploitation
dilemma
• Pairwise approach
– Input: pairs of documents with labels identifying which one is
preferred
– Learns classifier to predict these labels
• Listwise approach
– Input: entire document list associated with a certain query
– Directly optimizes evaluation measures (i.e., Normalized Discounted
Cumulative Gain)
Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. "Balancing Exploration and Exploitation in Listwise
and Pairwise Online Learning to Rank for Information Retrieval." Information Retrieval Inf Retrieval 16.1
(2012): 63-90.

Absolute and relative feedback
approaches
• Use feedback to learn personalized rankings
• Absolute feedback: contextual bandits learning
• Relative feedback: gradient methods and inferred preferences
between complete result rankings
• Relative is usually better
– Robust to noisy feedback
– Deals with larger document spaces
Chen, Yiwei, and Katja Hofmann. "Online Learning to Rank: Absolute vs. Relative" Proceedings of the
24th International Conference on World Wide Web - WWW '15 Companion (2015).

State of the Art Learning

Образец заголовкаImproving learning performance
• Search engine clicks are useful, but might be biased
– Bias might come from attractive titles, snippets, or captions
• Method to detect and compensate for caption bias
– Enable reweighting of clicks based on likelihood
– Attractive, clicked links are considered less relevant
K. Hofmann, F. Behr, and F. Radlinski. On caption bias in interleaving experiments. In Proc. of CIKM,
2012.

Образец заголовкаHandling caption bias
• Allow weighting of clicks based on likelihood that
each click is caption biased
• Model click probability as function of position,
relevance, and caption bias
– Visual characteristics of individual documents
– Pairwise feature to focus on relationships with
neighboring documents
• Learn model weights from past behavior of users
• Remove caption bias to obtain evaluation that
reflects better relevance

Образец заголовкаImproving learning speed
• Search engine clicks can be interpreted using interleaved comparison
methods (two main methods)
– Reliably infer preferences between pairs of rankers
• Dueling bandit gradient descent learns from these comparisons
– Requires pairwise comparisons involving users between all exploratory
rankers
• Multileave gradient descent learns from comparisons of multiple rankers at
once
– Uses a single user interaction
– Fast
Schuth, Anne, Harrie Oosterhuis, Shimon Whiteson, and Maarten De Rijke. "Multileave Gradient Descent for Fast Online
Learning to Rank." Proceedings of the Ninth ACM International Conference on Web Search and Data Mining - WSDM '16
(2016).

Evaluating Rankers

Образец заголовкаHow to evaluate rankers?
• After training a ranker, we need to find out how effective it is
• Offline evaluation methods
– Dependent on explicit expert judgements
– Not feasible in practice
• Online evaluation methods
– Leverage online data that can reflect ranker quality
– Click-based ranker evaluation (discussed next)
• State of the art software: Lerot
– Evaluates different algorithms
– Can simulate user clicking behaviour with user models
Schuth, Anne, Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. "Lerot." Proceedings of the 2013 Workshop on Living Labs for
Information Retrieval Evaluation - LivingLab '13 (2013).

Образец заголовкаClick-based ranker evaluation
• Online evaluation strategy based on clickthrough data
• Independent of expert judgments, unlike conventional evaluation
methods
– Measure reflects interest of an actual user rather than interest of an expert
providing relevance judgement

Challenges of using clickthrough
data
• Handling presentation bias
– Design user interface with three features
• Blind test: hidden random variables underlying the hypothesis test
• Click to preference: user’s click should reflect its actual judgment
• Low usability impact: interactive, user-friendly interface
• Identifying the superior of two rankers
– Unified user interface that sends user query to both rankers
– Mix two ranking results (discussed next)
– Show combined ranking to the user and record interesting/relevant clicks
T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, in J. Franke and G.
Nakhaeizadeh and I. Renz, "Text Mining", Physica/Springer Verlag, pp 79-96, 2003.

Образец заголовкаMixing two ranking results
• Also known as interleaving
• Key is to mix by balancing population from both rankers in top n
listings
• Algorithms vary in mixing strategy
– Balanced Interleaving
– Team-Draft Interleaving

Leveraging click responses from
mixed rankings
• Each click response represents user’s preference to ranker that provided the
clicked link
• Thus, proper leverage of clicks is critical
– Also known as test statistics
– Essential to reliable evaluation of rankers
• One basic approach is to assign equal weights to all clicks
– Suboptimal since not all clicks are equally significant
– Caption bias!
• More advanced test statistics, discussed next

Образец заголовкаTest statistics for evaluation
• Learn weights to maximize mean score difference between best and worst
rankers
• Optimize statistical power of z-test by maximizing z-score and p-value
– Removes assumption of equal variance of weights
• Learns to invert Wilcoxon Signed-Rank Test
– Produces scoring function to optimize Wilcoxon test
• Max mean difference performs the worst
• Inverse z-test performs the best
Yisong Yue, Yue Gao, O. Chapelle, Ya Zhang, T. Joachims, Learning more powerful test statistics for click-
based retrieval evaluation, Proceedings of the Conference on Research and Development in Information
Retrieval (SIGIR), 2010.

How good are interleaving
methods?
• Interleaving methods are compared against baseline:
conventional evaluation methods based on absolute metrics
• Conventional evaluation methods based on absolute metrics
– Absolute usage statistics are expected to monotonically
change with respect to ranker quality
• Interleaving methods
– More user clicks are expected for better ranker

Relative performance of
interleaving methods
• Experiment results on two rankers whose relative qualities are known by
construction
• Conventional evaluation methods based on absolute metrics
– Did not reliably identify high-quality rankers
– Absolute usage statistics did not monotonically change with respect to ranker quality
• Balanced Interleaving and Team-Draft Interleaving algorithms
– Reliably identified high-quality rankers
– Number of preferences for better ranker is significantly larger
F. Radlinski, M. Kurup, T. Joachims, How Does Clickthrough Data Reflect Retrieval Quality?,Proceedings of
the ACM Conference on Information and Knowledge Management (CIKM), 2008.

How much reliable and why to
choose interleaved methods?
• Results of interleaving agrees with conventional evaluation
methods
• Achieves statistically reliable preference compared to absolute
metrics
• Economical: statistical evaluation power of 10 interleaved clicks is
approximately equal to 1 manual judged query
• Not sensitive to different click aggregation schemes
• Can complement or even replace standard evaluation methods
based on manual judgments or absolute metrics
O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search Evaluation, ACM
Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.

Образец заголовкаFuture directions
• Extend current linear learning approaches with online learning to rank algorithms
that can effectively learn more complex models
• Designing and re-experimenting with more complex models for click behavior to
better understand various click biases.
• Learning distinctive properties, such as click dwell time and use of back button, to
filter out raw clicks.
• Understanding range of domains in which interleaving methods are highly effective.
• Improvement of gradient descent based rankers by covering all search directions to
speed up learning processes.

Образец заголовкаReferences
1. Chen, Yiwei, and Katja Hofmann. "Online Learning to Rank: Absolute vs. Relative" Proceedings of the 24th
International Conference on World Wide Web - WWW '15 Companion (2015).
2. F. Radlinski, M. Kurup, T. Joachims, How Does Clickthrough Data Reflect Retrieval Quality?,Proceedings of
the ACM Conference on Information and Knowledge Management (CIKM), 2008.
3. Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. "Balancing Exploration and Exploitation in
Listwise and Pairwise Online Learning to Rank for Information Retrieval." Information Retrieval Inf
Retrieval 16.1 (2012): 63-90.
4. T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, in J. Franke and G. Nakhaeizadeh
and I. Renz, "Text Mining", Physica/Springer Verlag, pp 79-96, 2003.
5. K. Hofmann, F. Behr, and F. Radlinski. On caption bias in interleaving experiments. In Proc. of CIKM, 2012.
6. O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search
Evaluation, ACM Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.
7. Schuth, Anne, Harrie Oosterhuis, Shimon Whiteson, and Maarten De Rijke. "Multileave Gradient Descent for
Fast Online Learning to Rank." Proceedings of the Ninth ACM International Conference on Web Search and
Data Mining - WSDM '16 (2016).
8. Schuth, Anne, Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. "Lerot." Proceedings of the 2013
Workshop on Living Labs for Information Retrieval Evaluation - LivingLab '13 (2013).
9. T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):
225–331, 2009.
10. Yisong Yue, Yue Gao, O. Chapelle, Ya Zhang, T. Joachims, Learning more powerful test statistics for click-
based retrieval evaluation, Proceedings of the Conference on Research and Development in Information
Retrieval (SIGIR), 2010.

Online Learning to Rank

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Online Learning to Rank

Similar to Online Learning to Rank (20)

Recently uploaded

Recently uploaded (20)

Online Learning to Rank