Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

A Learning to Rank Project on a Daily Song Ranking Problem

Download to read offline

Ranking data, i.e., ordered list of items, naturally appears in a wide variety of situation; understanding how to adapt a specific dataset and to design the best approach to solve a ranking problem in a real-world scenario is thus crucial.This talk aims to illustrate how to set up and build a Learning to Rank (LTR) project starting from the available data, in our case a Spotify Dataset (available on Kaggle) on the Worldwide Daily Song Ranking, and ending with the implementation of a ranking model. A step by step (phased) approach to cope with this task using open source libraries will be presented.We will examine in depth the most important part of the pipeline that is the data preprocessing and in particular how to model and manipulate the features in order to create the proper input dataset, tailored to the machine learning algorithm requirements.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

A Learning to Rank Project on a Daily Song Ranking Problem

  1. 1. London Information Retrieval Meetup A Learning to Rank Project on a Daily Song Ranking Problem Ilaria Petreti, Information Retrieval/ML Engineer 3rd November 2020
  2. 2. London Information Retrieval Meetup Ilaria Petreti ! Information Retrieval/Machine Learning Engineer ! Master in Data Science ! Data Mining and Machine Learning technologies passionate ! Sports and Healthy Lifestyle lover Who I Am
  3. 3. London Information Retrieval Meetup ● Headquarter in London/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Es experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning www.sease.io Search Services
  4. 4. London Information Retrieval Meetup Clients
  5. 5. London Information Retrieval Meetup Overview Problem Statement Data Preprocessing Model Training Results
  6. 6. London Information Retrieval Meetup How to create a Learning to Rank Pipeline using the Spotify’s Kaggle Dataset?! Problem Statement https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
  7. 7. London Information Retrieval Meetup LTR is the application of machine learning, typically supervised, semi- supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items and each item is composed by: • Query ID • Relevance Rating • Feature Vector (composed by N features (<id>:<value>)) Learning to Rank
  8. 8. London Information Retrieval Meetup Spotify’s Worldwide Daily Song Ranking: • 200 most listened songs in 53 countries • From 1st January 2017 to 9th January 2018 • More than 3 million rows • 6629 artists and 18598 songs • A total count of one hundred five billion streams counts Dataset Description
  9. 9. London Information Retrieval Meetup Learning to Rank: Our Approach Trained Ranking Model QUERY is the Region DOCUMENT is the Song Relevance Rating = estimated from Position on Chart Feature Vector = all the other N features Spotify Search Engine
  10. 10. London Information Retrieval Meetup Data Preprocessing Model Training Results Problem Statement
  11. 11. London Information Retrieval Meetup Feature Level Document level Query level Query Dependent This feature describes a property of the DOCUMENT. The value of the feature depends only on the document instance. e.g. Document Type = Digital Music Service Product - Track Name - Artist - Streams Each sample is a <query,document> pair, the feature vector describes numerically this This feature describes a property of the QUERY. The value of the feature depends only on the query instance. e.g. Query Type = Digital Music Service Search - Month - Day - Weekday This feature describes a property of the QUERY in correlation with the DOCUMENT. The value of the feature depends on the query and document instance. e.g. Query Type = Digital Music Service Search Document Type = Digital Music Service Product - Matching query Region-Title Language - Matching query Region-Artist Nationality
  12. 12. London Information Retrieval Meetup Data Cleaning: Data Preprocessing: Data Cleaning Validity Accuracy Consistency Completeness Uniformity Handle Missing Values: a total of 657 NaN in Track Name and Artist features filled using a DICTIONARY: {0: 'Reggaetón Lento (Bailemos)', 1: 'Chantaje', 2: 'Otra Vez (feat. J Balvin)', 3: "Vente Pa' Ca", 4: 'Safari', 5: 'La Bicicleta', 6: 'Ay Mi Dios', 7: 'Andas En Mi Cabeza', 8: 'Traicionera', 9: 'Shaky Shaky', 10: 'Vacaciones', 11: 'Dile Que Tu Me Quieres', 12: 'Let Me Love You', 13: 'DUELE EL CORAZON', 14: 'Chillax', 15: 'Borro Cassette', 16: 'One Dance', 17: 'Closer', …} ID (URL) Track Name 0 Reggaetón Lento (Bailemos) 1 Chantaje 2 Otra Vez (feat. J Balvin)' 0 NaN 3 Vente Pa' Ca 4 Safari 3 NaN
  13. 13. London Information Retrieval Meetup Feature Engineering: ! Prepare the proper input dataset, compatible with the machine learning algorithm requirements ! Improve the performance of machine learning models Feature Engineering Feature Selection Feature Extraction Feature Transformation Feature Importance Categorical Encoding
  14. 14. London Information Retrieval Meetup Position: song's position on chart Feature Engineering: Grouping Position 1 2 3 4-5 6-10 11-20 21-35 36-55 56-80 81-130 131-200 Ranking 10 9 8 7 6 5 4 3 2 1 0 Position Values have been grouped in two different ways: 1. Relevance Labels (Ranking) from 0 to 10 2. Relevance Labels (Ranking) from 0 to 20 Target - Relevance Rating Position 1 2 3 4 5 6 7 8 9 … 200
  15. 15. London Information Retrieval Meetup Feature hashing maps each category in a categorical feature to an integer within a pre-determined range Track Name: song title Feature Engineering: Categorical Encoding Track Name Reggaetón Lento (Bailemos) Chantaje Otra Vez (feat. J Balvin) … Let Her Go It is a method to create a numeric representation of a document/sentences, regardless of its length 2 different approaches: Hash Encoding doc2vec Document Level Feature
  16. 16. London Information Retrieval Meetup Categorical Encoding: Hash Encoding Feature Hashing or “The Hashing Trick” is a fast and space-efficient way of vectorising features ! Use of category_encoders library (as ce) ! Main Arguments: title_encoder = ce.HashingEncoder(cols=[‘Track Name'], n_components=8) newds = title_encoder.fit_transform(ds2) • cols: a list of columns to encode • n_components: how many bits to use to represent the feature (default is 8 bits) • hash_method: which hashing method to use (default is “md5” algorithm) https://contrib.scikit-learn.org/category_encoders/hashing.html
  17. 17. London Information Retrieval Meetup Categorical Encoding: Doc2Vec ! Adaptation of Word2Vec, adding another feature vector named Paragraph ID ! Use of the gensim library ! Replace sentence as a list of words (token) ! Create new instance of TaggedDocument (token, tag) ! Build the Vocabulary ! Train the Doc2Vec model, the main parameters are: • Documents: iterable list of TaggedDocument elements; • dm{1,0}: defines the training algorithm; by default dm = 1 that is Distributed Memory version of Paragraph Vector (PV-DM); • min_count: ignores all words with total frequency lower than this; • vector_size: dimensionality of the feature vectors (100 by default). TaggedDocument Trained Document Vectors https://radimrehurek.com/gensim/models/doc2vec.html
  18. 18. London Information Retrieval Meetup Language Detection from the Song Titles Feature Engineering ! langdetect ! guess_language-spirit ! TextBlob ! Googletrans • Low accuracy (built for large text) • No limitation • High accuracy • Limited access (API) https://pypi.org/ https://textblob.readthedocs.io/en/dev/api_reference.html
  19. 19. London Information Retrieval Meetup Artist: name of musician/singer or group Artist CNCO Shakira Zion & Lennox … Passengers Artists 78.12742 68.62432 61.62190 … 167.15266 Feature Engineering: Categorical Encoding Leave One Out Encoding 0.39 0.24 2.21 0.76 0.27 4.01 2.28 0.19 2.03 1,96 5.15 0.36 1.06 A C B B C A mean = 1.06 TARGET FEATURE Document Level Feature ! Use of category_encoders library ! It excludes the current row’s target when calculating the mean target for a level https://contrib.scikit-learn.org/category_encoders/leaveoneout.html
  20. 20. London Information Retrieval Meetup Date: chart date Year Month Day Weekday 2017 1 1 6 2017 1 2 0 2017 1 3 1 … … … … 2018 1 9 1 Date 2017/01/01 2017/01/02 2017/01/03 … 2018/01/09 Feature Engineering: Extracting Date Query Level Feature
  21. 21. London Information Retrieval Meetup Region: country code Feature Engineering Query Region ec fi cr … hn query_ID 0 1 2 … 53 pandas.factorize() to obtain a numeric representation of an array when all that matters is identifying distinct values
  22. 22. London Information Retrieval Meetup Feature Engineering Final Dataset
  23. 23. London Information Retrieval Meetup Problem Statement Data Preprocessing Model Training Results
  24. 24. London Information Retrieval Meetup Model Training: XGBoost XGBoost is an optimised distributed gradient boosting library designed to be highly efficient, flexible and portable. https://github.com/dmlc/xgboost ! It implements machine learning algorithms under the Gradient Boosting framework. ! It is Open Source ! It supports both pairwise and list-wise models
  25. 25. London Information Retrieval Meetup Model Training: XGBoost 1. Split the entire dataset in: 2. Separate the Relevance Label, query_ID and training vectors as different components to create the xgboost matrices Training Set, used to build and train the model (80%) Test Set, used to evaluate the model performance on unseen data (20%) DMatrix is an internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed
  26. 26. London Information Retrieval Meetup training_xgb_matrix = xgboost.DMatrix(training_data_set, label=training_label_column) training_xgb_matrix.set_group(training_query_groups) training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( ['Ranking', 'ID', 'query_ID'])] training_query_id_column = training_set_data_frame['query_ID'] training_query_groups = training_query_id_column.value_counts(sort=False) training_label_column = training_set_data_frame['Ranking'] Training and Test Set Creation test_xgb_matrix = xgboost.DMatrix(test_data_set, label=test_label_column) test_xgb_matrix.set_group(test_query_groups) test_data_set = test_set_data_frame[ test_set_data_frame.columns.difference( ['Ranking', 'ID', 'query_ID'])] test_query_id_column = test_set_data_frame['query_ID'] test_query_groups = test_query_id_column.value_counts(sort=False) test_label_column = test_set_data_frame['Ranking']
  27. 27. London Information Retrieval Meetup Train and test the model with LambdaMART method: Model Training: XGBoost ! LambdaMART model uses gradient boosted decision tree using a cost function derived from LambdaRank for solving a Ranking Task. ! The model performs list-wise ranking where Normalised Discounted Cumulative Gain (NDCG) is maximised. ! List-wise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it ! The Evaluation Measure is an average across the queries.
  28. 28. London Information Retrieval Meetup Train and test the model with LambdaMART: params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@10', 'verbosity': 2, 'early_stopping_rounds': 10} watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, ‘train')] print('- - - - Training The Model') xgb_model = xgboost.train(params, training_xgb_matrix, num_boost_round=999, evals=watch_list) print('- - - - Saving XGBoost model’) xgboost_model_json = output_dir + "/xgboost-" + name + ".json" xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True, dump_format='json') Model Training: LambdaMART
  29. 29. London Information Retrieval Meetup • DCG@K = Discounted Cumulative Gain@K It measures the usefulness, or gain, of a document based on its position in the result list. Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K • It will be in the range [0,1] Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 14,01 15,76 17,64 22,60 0,62 0,70 0,78 1,0 Evaluation Metric: List-wise and NDCG relevance weight result position DCG NDCG
  30. 30. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the model creation: ! One sample per query group ! One Relevance Label for all the samples in a query group: Under Sampled Query Ids can potentially sky rock your NDCG avg Common Mistakes
  31. 31. London Information Retrieval Meetup Problem Statement Model Training Data Preprocessing Results
  32. 32. London Information Retrieval Meetup Results train-ndcg@10 eval-ndcg@10 Relevance Labels (0-10) 0.7179 0.7351 Relevance Labels (0-20) 0.8018 0.7740 Relevance Labels (0-10) 0.8235 0.7633 Relevance Labels (0-20) 0.8215 0.8244 doc2vec Encoding Hash Encoding NDCG@10, where ‘@10’ denotes that the metric is evaluated only on top 10 documents/songs
  33. 33. London Information Retrieval Meetup ! Importance of Data Preprocessing and Feature Engineering ! Language Detection as additional feature ! doc2vec and Relevance Rating [0, 20] as best approaches ! Online testing in LTR evaluation ! Use of the library Tree SHAP for the feature importance https://github.com/slundberg/shap Conclusions
  34. 34. London Information Retrieval Meetup Thanks!

Ranking data, i.e., ordered list of items, naturally appears in a wide variety of situation; understanding how to adapt a specific dataset and to design the best approach to solve a ranking problem in a real-world scenario is thus crucial.This talk aims to illustrate how to set up and build a Learning to Rank (LTR) project starting from the available data, in our case a Spotify Dataset (available on Kaggle) on the Worldwide Daily Song Ranking, and ending with the implementation of a ranking model. A step by step (phased) approach to cope with this task using open source libraries will be presented.We will examine in depth the most important part of the pipeline that is the data preprocessing and in particular how to model and manipulate the features in order to create the proper input dataset, tailored to the machine learning algorithm requirements.

Views

Total views

330

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×