Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neural Learning to Rank

Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.

  • Be the first to comment

Neural Learning to Rank

  1. 1. Neural Learning to Rank Bhaskar Mitra Principal Applied Scientist, Microsoft PhD candidate, University College London @UnderdogGeek
  2. 2. Topics A quick recap of neural networks The fundamentals of learning to rank
  3. 3. Reading material An Introduction to Neural Information Retrieval Foundations and Trendsยฎ in Information Retrieval (December 2018) Download PDF: http://bit.ly/fntir-neural
  4. 4. Most information retrieval (IR) systems present a ranked list of retrieved artifacts
  5. 5. Learning to Rank (LTR) โ€... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.โ€ - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  6. 6. A quick recap of neural networks
  7. 7. Vectors, matrices, and tensors Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66 Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/ matrix transpose matrix addition dot product matrix multiplication
  8. 8. Supervised learning Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
  9. 9. Neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (ฯƒ) Popular choices for ฯƒ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  10. 10. Basic machine learning tasks
  11. 11. Squared loss The squared loss is a popular loss function for regression tasks
  12. 12. The softmax function In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  13. 13. Cross entropy The cross entropy between two probability distributions ๐‘ and ๐‘ž over a discrete set of events is given by, If ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก = 1and ๐‘๐‘– = 0 for all other values of ๐‘– then,
  14. 14. Cross entropy with softmax loss Cross entropy with softmax is a popular loss function for classification
  15. 15. We are given training data: < ๐‘ฅ, ๐‘ฆ > pairs, where ๐‘ฅ is input and ๐‘ฆ is expected output Step 1: Define model and randomly initialize learnable model parameters Step 2: Given ๐‘ฅ, compute model output Step 3: Given model output and ๐‘ฆ, compute loss ๐‘™ Step 4: Compute gradient ๐œ•๐‘™ ๐œ•๐‘ค of loss ๐‘™ w.r.t. each parameter ๐‘ค Step 5: Update parameter as ๐‘ค ๐‘›๐‘’๐‘ค = ๐‘ค ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค , where ๐œ‚ is learning rate Step 6: Go back to step 2 and repeat till convergence Gradient Descent
  16. 16. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ•๐‘™ ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  17. 17. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ• ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  18. 18. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  19. 19. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  20. 20. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  21. 21. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  22. 22. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค1. ๐‘ฅ + ๐‘1 ร— ๐‘ฅ Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  23. 23. Exercise Simple Neural Network from Scratch Implement a simple multi-layer neural network with single input feature, single output, and single neuron per layer using (i) PyTorch and (ii) from scratchโ€”and demonstrate that both approaches produce identical outcome. https://github.com/spacemanidol/AFIRMDeep Learning2020/blob/master/NNPrimer.ipynb
  24. 24. Computation Networks The โ€œLegoโ€ approach to specifying neural architectures Library of neural layers, each layer defines logic for: 1. Forward pass: compute layer output given layer input 2. Backward pass: a) compute gradient of layer output w.r.t. layer inputs b) compute gradient of layer output w.r.t. layer parameters (if any) Chain nodes to create bigger and more complex networks
  25. 25. Why adding depth helps http://playground.tensorflow.org
  26. 26. Bias-Variance trade- off https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b
  27. 27. Bias-variance trade-off in the deep learning era Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโ€“variance trade-off. In PNAS, 2019.
  28. 28. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. The lottery ticket hypothesis
  29. 29. Questions?
  30. 30. The fundamentals of learning to rank
  31. 31. Problem formulation LTR models represent a rankable itemโ€”e.g., a document or a movie or a songโ€”given some contextโ€”e.g., a user-issued query or userโ€™s historical interactions with other itemsโ€”as a numerical vector ๐‘ฅ โˆˆ โ„ ๐‘› The ranking model ๐‘“: ๐‘ฅ โ†’ โ„ is trained to map the vector to a real-valued score such that relevant items are scored higher.
  32. 32. Why is ranking challenging? Examples of ranking metrics Discounted Cumulative Gain (DCG) ๐ท๐ถ๐บ@๐‘˜ = ๐‘–=1 ๐‘˜ 2 ๐‘Ÿ๐‘’๐‘™๐‘– โˆ’ 1 ๐‘™๐‘œ๐‘”2 ๐‘– + 1 Reciprocal Rank (RR) ๐‘…๐‘…@๐‘˜ = max 1<๐‘–<๐‘˜ ๐‘Ÿ๐‘’๐‘™๐‘– ๐‘– Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
  33. 33. Features They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  34. 34. Features Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  35. 35. Approaches Pointwise approach Relevance label ๐‘ฆ ๐‘ž,๐‘‘ is a numberโ€”derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict ๐‘ฆ ๐‘ž,๐‘‘ given ๐‘ฅ ๐‘ž,๐‘‘. Pairwise approach Pairwise preference between documents for a query (๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCGโ€”difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  36. 36. Pointwise objectives Regression loss Given ๐‘ž, ๐‘‘ predict the value of ๐‘ฆ ๐‘ž,๐‘‘ e.g., square loss for binary or categorical labels, where, ๐‘ฆ ๐‘ž,๐‘‘ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  37. 37. Pointwise objectives Classification loss Given ๐‘ž, ๐‘‘ predict the class ๐‘ฆ ๐‘ž,๐‘‘ e.g., cross-entropy with softmax over categorical labels ๐‘Œ [Li et al., 2008], where, ๐‘  ๐‘ฆ ๐‘ž,๐‘‘ is the modelโ€™s score for label ๐‘ฆ ๐‘ž,๐‘‘ labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  38. 38. Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009], where, ๐œ™ can be, โ€ข Hinge function ๐œ™ ๐‘ง = ๐‘š๐‘Ž๐‘ฅ 0, 1 โˆ’ ๐‘ง [Herbrich et al., 2000] โ€ข Exponential function ๐œ™ ๐‘ง = ๐‘’โˆ’๐‘ง [Freund et al., 2003] โ€ข Logistic function ๐œ™ ๐‘ง = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐‘ง [Burges et al., 2005] โ€ข Othersโ€ฆ Pairwise loss minimizes the average number of inversions in rankingโ€”i.e., ๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž but ๐‘‘๐‘— is ranked higher than ๐‘‘๐‘– Given ๐‘ž, ๐‘‘๐‘–, ๐‘‘๐‘— , predict the more relevant document For ๐‘ž, ๐‘‘๐‘– and ๐‘ž, ๐‘‘๐‘— , Feature vectors: ๐‘ฅ๐‘– and ๐‘ฅ๐‘— Model scores: ๐‘ ๐‘– = ๐‘“ ๐‘ฅ๐‘– and ๐‘ ๐‘— = ๐‘“ ๐‘ฅ๐‘— Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  39. 39. Pairwise objectives RankNet loss Pairwise loss function proposed by Burges et al. [2005]โ€”an industry favourite [Burges, 2015] Predicted probabilities: ๐‘๐‘–๐‘— = ๐‘ ๐‘ ๐‘– > ๐‘ ๐‘— โ‰ก ๐‘’ ๐›พ.๐‘  ๐‘– ๐‘’ ๐›พ.๐‘  ๐‘– +๐‘’ ๐›พ.๐‘  ๐‘— = 1 1+๐‘’ โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— Desired probabilities: ๐‘๐‘–๐‘— = 1 and ๐‘๐‘—๐‘– = 0 Computing cross-entropy between ๐‘ and ๐‘ โ„’ ๐‘…๐‘Ž๐‘›๐‘˜๐‘๐‘’๐‘ก = โˆ’ ๐‘๐‘–๐‘—. ๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— โˆ’ ๐‘๐‘—๐‘–. ๐‘™๐‘œ๐‘” ๐‘๐‘—๐‘– = โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  40. 40. A generalized cross-entropy loss An alternative loss function assumes a single relevant document ๐‘‘+ and compares it against the full collection ๐ท Predicted probabilities: p ๐‘‘+|๐‘ž = ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ The cross-entropy loss is then given by, โ„’ ๐ถ๐ธ ๐‘ž, ๐‘‘+, ๐ท = โˆ’๐‘™๐‘œ๐‘” p ๐‘‘+|๐‘ž = โˆ’๐‘™๐‘œ๐‘” ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ Computing the softmax over the full collection is prohibitively expensiveโ€”LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  41. 41. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  42. 42. Listwise objectives Burges et al. [2006] make two observations: 1. To train a model we donโ€™t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  43. 43. Listwise objectives According to the Luce model [Luce, 2005], given four items ๐‘‘1, ๐‘‘2, ๐‘‘3, ๐‘‘4 the probability of observing a particular rank-order, say ๐‘‘2, ๐‘‘1, ๐‘‘4, ๐‘‘3 , is given by: where, ๐œ‹ is a particular permutation and ๐œ™ is a transformation (e.g., linear, exponential, or sigmoid) over the score ๐‘ ๐‘– corresponding to item ๐‘‘๐‘– R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  44. 44. Listwise objectives Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a โ€œsmoothโ€ rank of documents as a function of their scores This โ€œsmoothโ€ rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss
  45. 45. Questions? @UnderdogGeek bmitra@microsoft.com

ร—