SlideShare a Scribd company logo
1 of 34
Download to read offline
London Information Retrieval Meetup
A Learning to Rank Project on a
Daily Song Ranking Problem
Ilaria Petreti, Information Retrieval/ML
Engineer
3rd November 2020
London Information Retrieval Meetup
Ilaria Petreti
! Information Retrieval/Machine Learning
Engineer
! Master in Data Science
! Data Mining and Machine Learning
technologies passionate
! Sports and Healthy Lifestyle lover
Who I Am
London Information Retrieval Meetup
● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
www.sease.io
Search Services
London Information Retrieval Meetup
Clients
London Information Retrieval Meetup
Overview
Problem Statement
Data Preprocessing
Model Training
Results
London Information Retrieval Meetup
How to create a Learning to Rank Pipeline using the
Spotify’s Kaggle Dataset?!
Problem Statement
https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
London Information Retrieval Meetup
LTR is the application of machine learning, typically supervised, semi-
supervised or reinforcement learning, in the construction of ranking models for
information retrieval systems.
Training data consists of lists of items and each item is composed by:
• Query ID
• Relevance Rating
• Feature Vector (composed by N features (<id>:<value>))
Learning to Rank
London Information Retrieval Meetup
Spotify’s Worldwide
Daily Song Ranking:
• 200 most listened songs in 53
countries
• From 1st January 2017 to 9th
January 2018
• More than 3 million rows
• 6629 artists and 18598 songs
• A total count of one hundred five
billion streams counts
Dataset Description
London Information Retrieval Meetup
Learning to Rank: Our Approach
Trained Ranking Model
QUERY is the Region
DOCUMENT is the Song
Relevance Rating = estimated from Position on Chart
Feature Vector = all the other N features
Spotify Search Engine
London Information Retrieval Meetup
Data Preprocessing
Model Training
Results
Problem Statement
London Information Retrieval Meetup
Feature Level
Document level Query level Query Dependent
This feature describes a
property of the DOCUMENT.
The value of the feature depends only on
the document instance.
e.g.
Document Type = Digital Music Service
Product
- Track Name
- Artist
- Streams
Each sample is a <query,document> pair, the feature vector describes numerically this
This feature describes a
property of the QUERY.
The value of the feature depends only on
the query instance.
e.g.
Query Type = Digital Music Service Search
- Month
- Day
- Weekday
This feature describes a
property of the QUERY in correlation
with the DOCUMENT.
The value of the feature depends on
the query and document instance.
e.g.
Query Type = Digital Music Service Search
Document Type = Digital Music Service
Product
- Matching query Region-Title Language
- Matching query Region-Artist Nationality
London Information Retrieval Meetup
Data Cleaning:
Data Preprocessing: Data Cleaning
Validity
Accuracy
Consistency
Completeness
Uniformity
Handle Missing Values:
a total of 657 NaN in Track Name and Artist features filled using a
DICTIONARY:
{0: 'Reggaetón Lento (Bailemos)', 1: 'Chantaje', 2: 'Otra Vez (feat. J Balvin)', 3:
"Vente Pa' Ca", 4: 'Safari', 5: 'La Bicicleta', 6: 'Ay Mi Dios', 7: 'Andas En Mi Cabeza',
8: 'Traicionera', 9: 'Shaky Shaky', 10: 'Vacaciones', 11: 'Dile Que Tu Me Quieres', 12:
'Let Me Love You', 13: 'DUELE EL CORAZON', 14: 'Chillax', 15: 'Borro Cassette', 16:
'One Dance', 17: 'Closer', …}
ID (URL) Track Name
0
Reggaetón
Lento
(Bailemos)
1 Chantaje
2
Otra Vez (feat.
J Balvin)'
0 NaN
3 Vente Pa' Ca
4 Safari
3 NaN
London Information Retrieval Meetup
Feature Engineering:
! Prepare the proper input dataset, compatible with the machine learning
algorithm requirements
! Improve the performance of machine learning models
Feature Engineering
Feature Selection
Feature Extraction
Feature Transformation
Feature Importance
Categorical
Encoding
London Information Retrieval Meetup
Position: song's position on chart
Feature Engineering: Grouping
Position
1
2
3
4-5
6-10
11-20
21-35
36-55
56-80
81-130
131-200
Ranking
10
9
8
7
6
5
4
3
2
1
0
Position Values have been grouped in two
different ways:
1. Relevance Labels (Ranking) from 0 to 10
2. Relevance Labels (Ranking) from 0 to 20
Target - Relevance Rating
Position
1
2
3
4
5
6
7
8
9
…
200
London Information Retrieval Meetup
Feature hashing maps each category
in a categorical feature to an integer
within a pre-determined range
Track Name: song title
Feature Engineering: Categorical Encoding
Track Name
Reggaetón Lento
(Bailemos)
Chantaje
Otra Vez (feat. J
Balvin)
…
Let Her Go
It is a method to create a
numeric representation of a
document/sentences, regardless
of its length
2 different approaches:
Hash Encoding
doc2vec
Document Level Feature
London Information Retrieval Meetup
Categorical Encoding: Hash Encoding
Feature Hashing or “The Hashing Trick” is a fast and space-efficient way of vectorising features
! Use of category_encoders library (as ce)
! Main Arguments:
title_encoder = ce.HashingEncoder(cols=[‘Track Name'], n_components=8)
newds = title_encoder.fit_transform(ds2)
• cols: a list of columns to encode
• n_components: how many bits to use to represent the feature
(default is 8 bits)
• hash_method: which hashing method to use (default is “md5”
algorithm)
https://contrib.scikit-learn.org/category_encoders/hashing.html
London Information Retrieval Meetup
Categorical Encoding: Doc2Vec
! Adaptation of Word2Vec, adding another feature vector named Paragraph ID
! Use of the gensim library
! Replace sentence as a list of words (token)
! Create new instance of TaggedDocument (token, tag)
! Build the Vocabulary
! Train the Doc2Vec model, the main parameters are:
• Documents: iterable list of TaggedDocument elements;
• dm{1,0}: defines the training algorithm; by default dm = 1 that is
Distributed Memory version of Paragraph Vector (PV-DM);
• min_count: ignores all words with total frequency lower than this;
• vector_size: dimensionality of the feature vectors (100 by default).
TaggedDocument
Trained Document Vectors
https://radimrehurek.com/gensim/models/doc2vec.html
London Information Retrieval Meetup
Language Detection from the Song Titles
Feature Engineering
! langdetect
! guess_language-spirit
! TextBlob
! Googletrans
• Low accuracy (built for
large text)
• No limitation
• High accuracy
• Limited access (API)
https://pypi.org/
https://textblob.readthedocs.io/en/dev/api_reference.html
London Information Retrieval Meetup
Artist: name of musician/singer or group
Artist
CNCO
Shakira
Zion &
Lennox
…
Passengers
Artists
78.12742
68.62432
61.62190
…
167.15266
Feature Engineering: Categorical Encoding
Leave One Out Encoding 0.39
0.24
2.21
0.76
0.27
4.01
2.28
0.19
2.03
1,96
5.15
0.36
1.06
A
C
B
B
C
A
mean = 1.06
TARGET FEATURE
Document Level Feature
! Use of category_encoders
library
! It excludes the current row’s
target when calculating the
mean target for a level
https://contrib.scikit-learn.org/category_encoders/leaveoneout.html
London Information Retrieval Meetup
Date: chart date
Year Month Day Weekday
2017 1 1 6
2017 1 2 0
2017 1 3 1
… … … …
2018 1 9 1
Date
2017/01/01
2017/01/02
2017/01/03
…
2018/01/09
Feature Engineering: Extracting Date
Query Level Feature
London Information Retrieval Meetup
Region: country code
Feature Engineering
Query
Region
ec
fi
cr
…
hn
query_ID
0
1
2
…
53
pandas.factorize()
to obtain a numeric representation of an array
when all that matters is identifying distinct values
London Information Retrieval Meetup
Feature Engineering
Final Dataset
London Information Retrieval Meetup
Problem Statement
Data Preprocessing
Model Training
Results
London Information Retrieval Meetup
Model Training: XGBoost
XGBoost is an optimised distributed gradient boosting library
designed to be highly efficient, flexible and portable.
https://github.com/dmlc/xgboost
! It implements machine learning algorithms under the Gradient
Boosting framework.
! It is Open Source
! It supports both pairwise and list-wise models
London Information Retrieval Meetup
Model Training: XGBoost
1. Split the entire dataset in:
2. Separate the Relevance Label, query_ID and training
vectors as different components to create the xgboost
matrices
Training Set, used to build and train the model (80%)
Test Set, used to evaluate the model performance on unseen data (20%)
DMatrix is an internal data structure that used by
XGBoost which is optimized for both memory efficiency
and training speed
London Information Retrieval Meetup
training_xgb_matrix = xgboost.DMatrix(training_data_set,
label=training_label_column)
training_xgb_matrix.set_group(training_query_groups)
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
training_query_id_column = training_set_data_frame['query_ID']
training_query_groups = training_query_id_column.value_counts(sort=False)
training_label_column = training_set_data_frame['Ranking']
Training and Test Set Creation
test_xgb_matrix = xgboost.DMatrix(test_data_set, label=test_label_column)
test_xgb_matrix.set_group(test_query_groups)
test_data_set = test_set_data_frame[
test_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
test_query_id_column = test_set_data_frame['query_ID']
test_query_groups = test_query_id_column.value_counts(sort=False)
test_label_column = test_set_data_frame['Ranking']
London Information Retrieval Meetup
Train and test the model with LambdaMART method:
Model Training: XGBoost
! LambdaMART model uses gradient boosted decision tree using a cost
function derived from LambdaRank for solving a Ranking Task.
! The model performs list-wise ranking where Normalised Discounted
Cumulative Gain (NDCG) is maximised.
! List-wise approaches directly look at the entire list of documents and
try to come up with the optimal ordering for it
! The Evaluation Measure is an average across the queries.
London Information Retrieval Meetup
Train and test the model with LambdaMART:
params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@10', 'verbosity': 2,
'early_stopping_rounds': 10}
watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, ‘train')]
print('- - - - Training The Model')
xgb_model = xgboost.train(params, training_xgb_matrix, num_boost_round=999,
evals=watch_list)
print('- - - - Saving XGBoost model’)
xgboost_model_json = output_dir + "/xgboost-" + name + ".json"
xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True,
dump_format='json')
Model Training: LambdaMART
London Information Retrieval Meetup
• DCG@K = Discounted Cumulative Gain@K
It measures the usefulness, or gain, of a document based
on its position in the result list.
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
• It will be in the range [0,1]
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
14,01 15,76 17,64 22,60
0,62 0,70 0,78 1,0
Evaluation Metric: List-wise and NDCG
relevance weight
result position
DCG
NDCG
London Information Retrieval Meetup
Let’s see the common mistakes to avoid during the
model creation:
! One sample per query group
! One Relevance Label for all the samples in a query group:
Under Sampled Query Ids can potentially sky rock your
NDCG avg
Common Mistakes
London Information Retrieval Meetup
Problem Statement
Model Training
Data Preprocessing
Results
London Information Retrieval Meetup
Results
train-ndcg@10 eval-ndcg@10
Relevance Labels
(0-10) 0.7179 0.7351
Relevance Labels
(0-20) 0.8018 0.7740
Relevance Labels
(0-10) 0.8235 0.7633
Relevance Labels
(0-20) 0.8215 0.8244
doc2vec
Encoding
Hash
Encoding
NDCG@10, where ‘@10’ denotes that the metric is evaluated only on top 10 documents/songs
London Information Retrieval Meetup
! Importance of Data Preprocessing and Feature Engineering
! Language Detection as additional feature
! doc2vec and Relevance Rating [0, 20] as best approaches
! Online testing in LTR evaluation
! Use of the library Tree SHAP for the feature importance
https://github.com/slundberg/shap
Conclusions
London Information Retrieval Meetup
Thanks!

More Related Content

What's hot

Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Databricks
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudUsing S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudDatabricks
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data PlatformsAnkit Rathi
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureDatabricks
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMongoDB
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 

What's hot (20)

Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudUsing S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data Platforms
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 

Similar to A Learning to Rank Project on a Daily Song Ranking Problem

Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Daniele Bailo
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Péter Király
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadatasuyu22
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin &  Leanne La...OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin &  Leanne La...
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...NETWAYS
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...Paul Lo
 
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Bradley Allen
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 

Similar to A Learning to Rank Project on a Daily Song Ranking Problem (20)

Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph Embeddings
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2Metadata & brokering - a modern approach #2
Metadata & brokering - a modern approach #2
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin &  Leanne La...OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin &  Leanne La...
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
 
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
Searching BBC Rushes Using Semantic Web Techniques (TRECVID 2005)
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

A Learning to Rank Project on a Daily Song Ranking Problem

  • 1. London Information Retrieval Meetup A Learning to Rank Project on a Daily Song Ranking Problem Ilaria Petreti, Information Retrieval/ML Engineer 3rd November 2020
  • 2. London Information Retrieval Meetup Ilaria Petreti ! Information Retrieval/Machine Learning Engineer ! Master in Data Science ! Data Mining and Machine Learning technologies passionate ! Sports and Healthy Lifestyle lover Who I Am
  • 3. London Information Retrieval Meetup ● Headquarter in London/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Es experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning www.sease.io Search Services
  • 5. London Information Retrieval Meetup Overview Problem Statement Data Preprocessing Model Training Results
  • 6. London Information Retrieval Meetup How to create a Learning to Rank Pipeline using the Spotify’s Kaggle Dataset?! Problem Statement https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
  • 7. London Information Retrieval Meetup LTR is the application of machine learning, typically supervised, semi- supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items and each item is composed by: • Query ID • Relevance Rating • Feature Vector (composed by N features (<id>:<value>)) Learning to Rank
  • 8. London Information Retrieval Meetup Spotify’s Worldwide Daily Song Ranking: • 200 most listened songs in 53 countries • From 1st January 2017 to 9th January 2018 • More than 3 million rows • 6629 artists and 18598 songs • A total count of one hundred five billion streams counts Dataset Description
  • 9. London Information Retrieval Meetup Learning to Rank: Our Approach Trained Ranking Model QUERY is the Region DOCUMENT is the Song Relevance Rating = estimated from Position on Chart Feature Vector = all the other N features Spotify Search Engine
  • 10. London Information Retrieval Meetup Data Preprocessing Model Training Results Problem Statement
  • 11. London Information Retrieval Meetup Feature Level Document level Query level Query Dependent This feature describes a property of the DOCUMENT. The value of the feature depends only on the document instance. e.g. Document Type = Digital Music Service Product - Track Name - Artist - Streams Each sample is a <query,document> pair, the feature vector describes numerically this This feature describes a property of the QUERY. The value of the feature depends only on the query instance. e.g. Query Type = Digital Music Service Search - Month - Day - Weekday This feature describes a property of the QUERY in correlation with the DOCUMENT. The value of the feature depends on the query and document instance. e.g. Query Type = Digital Music Service Search Document Type = Digital Music Service Product - Matching query Region-Title Language - Matching query Region-Artist Nationality
  • 12. London Information Retrieval Meetup Data Cleaning: Data Preprocessing: Data Cleaning Validity Accuracy Consistency Completeness Uniformity Handle Missing Values: a total of 657 NaN in Track Name and Artist features filled using a DICTIONARY: {0: 'Reggaetón Lento (Bailemos)', 1: 'Chantaje', 2: 'Otra Vez (feat. J Balvin)', 3: "Vente Pa' Ca", 4: 'Safari', 5: 'La Bicicleta', 6: 'Ay Mi Dios', 7: 'Andas En Mi Cabeza', 8: 'Traicionera', 9: 'Shaky Shaky', 10: 'Vacaciones', 11: 'Dile Que Tu Me Quieres', 12: 'Let Me Love You', 13: 'DUELE EL CORAZON', 14: 'Chillax', 15: 'Borro Cassette', 16: 'One Dance', 17: 'Closer', …} ID (URL) Track Name 0 Reggaetón Lento (Bailemos) 1 Chantaje 2 Otra Vez (feat. J Balvin)' 0 NaN 3 Vente Pa' Ca 4 Safari 3 NaN
  • 13. London Information Retrieval Meetup Feature Engineering: ! Prepare the proper input dataset, compatible with the machine learning algorithm requirements ! Improve the performance of machine learning models Feature Engineering Feature Selection Feature Extraction Feature Transformation Feature Importance Categorical Encoding
  • 14. London Information Retrieval Meetup Position: song's position on chart Feature Engineering: Grouping Position 1 2 3 4-5 6-10 11-20 21-35 36-55 56-80 81-130 131-200 Ranking 10 9 8 7 6 5 4 3 2 1 0 Position Values have been grouped in two different ways: 1. Relevance Labels (Ranking) from 0 to 10 2. Relevance Labels (Ranking) from 0 to 20 Target - Relevance Rating Position 1 2 3 4 5 6 7 8 9 … 200
  • 15. London Information Retrieval Meetup Feature hashing maps each category in a categorical feature to an integer within a pre-determined range Track Name: song title Feature Engineering: Categorical Encoding Track Name Reggaetón Lento (Bailemos) Chantaje Otra Vez (feat. J Balvin) … Let Her Go It is a method to create a numeric representation of a document/sentences, regardless of its length 2 different approaches: Hash Encoding doc2vec Document Level Feature
  • 16. London Information Retrieval Meetup Categorical Encoding: Hash Encoding Feature Hashing or “The Hashing Trick” is a fast and space-efficient way of vectorising features ! Use of category_encoders library (as ce) ! Main Arguments: title_encoder = ce.HashingEncoder(cols=[‘Track Name'], n_components=8) newds = title_encoder.fit_transform(ds2) • cols: a list of columns to encode • n_components: how many bits to use to represent the feature (default is 8 bits) • hash_method: which hashing method to use (default is “md5” algorithm) https://contrib.scikit-learn.org/category_encoders/hashing.html
  • 17. London Information Retrieval Meetup Categorical Encoding: Doc2Vec ! Adaptation of Word2Vec, adding another feature vector named Paragraph ID ! Use of the gensim library ! Replace sentence as a list of words (token) ! Create new instance of TaggedDocument (token, tag) ! Build the Vocabulary ! Train the Doc2Vec model, the main parameters are: • Documents: iterable list of TaggedDocument elements; • dm{1,0}: defines the training algorithm; by default dm = 1 that is Distributed Memory version of Paragraph Vector (PV-DM); • min_count: ignores all words with total frequency lower than this; • vector_size: dimensionality of the feature vectors (100 by default). TaggedDocument Trained Document Vectors https://radimrehurek.com/gensim/models/doc2vec.html
  • 18. London Information Retrieval Meetup Language Detection from the Song Titles Feature Engineering ! langdetect ! guess_language-spirit ! TextBlob ! Googletrans • Low accuracy (built for large text) • No limitation • High accuracy • Limited access (API) https://pypi.org/ https://textblob.readthedocs.io/en/dev/api_reference.html
  • 19. London Information Retrieval Meetup Artist: name of musician/singer or group Artist CNCO Shakira Zion & Lennox … Passengers Artists 78.12742 68.62432 61.62190 … 167.15266 Feature Engineering: Categorical Encoding Leave One Out Encoding 0.39 0.24 2.21 0.76 0.27 4.01 2.28 0.19 2.03 1,96 5.15 0.36 1.06 A C B B C A mean = 1.06 TARGET FEATURE Document Level Feature ! Use of category_encoders library ! It excludes the current row’s target when calculating the mean target for a level https://contrib.scikit-learn.org/category_encoders/leaveoneout.html
  • 20. London Information Retrieval Meetup Date: chart date Year Month Day Weekday 2017 1 1 6 2017 1 2 0 2017 1 3 1 … … … … 2018 1 9 1 Date 2017/01/01 2017/01/02 2017/01/03 … 2018/01/09 Feature Engineering: Extracting Date Query Level Feature
  • 21. London Information Retrieval Meetup Region: country code Feature Engineering Query Region ec fi cr … hn query_ID 0 1 2 … 53 pandas.factorize() to obtain a numeric representation of an array when all that matters is identifying distinct values
  • 22. London Information Retrieval Meetup Feature Engineering Final Dataset
  • 23. London Information Retrieval Meetup Problem Statement Data Preprocessing Model Training Results
  • 24. London Information Retrieval Meetup Model Training: XGBoost XGBoost is an optimised distributed gradient boosting library designed to be highly efficient, flexible and portable. https://github.com/dmlc/xgboost ! It implements machine learning algorithms under the Gradient Boosting framework. ! It is Open Source ! It supports both pairwise and list-wise models
  • 25. London Information Retrieval Meetup Model Training: XGBoost 1. Split the entire dataset in: 2. Separate the Relevance Label, query_ID and training vectors as different components to create the xgboost matrices Training Set, used to build and train the model (80%) Test Set, used to evaluate the model performance on unseen data (20%) DMatrix is an internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed
  • 26. London Information Retrieval Meetup training_xgb_matrix = xgboost.DMatrix(training_data_set, label=training_label_column) training_xgb_matrix.set_group(training_query_groups) training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( ['Ranking', 'ID', 'query_ID'])] training_query_id_column = training_set_data_frame['query_ID'] training_query_groups = training_query_id_column.value_counts(sort=False) training_label_column = training_set_data_frame['Ranking'] Training and Test Set Creation test_xgb_matrix = xgboost.DMatrix(test_data_set, label=test_label_column) test_xgb_matrix.set_group(test_query_groups) test_data_set = test_set_data_frame[ test_set_data_frame.columns.difference( ['Ranking', 'ID', 'query_ID'])] test_query_id_column = test_set_data_frame['query_ID'] test_query_groups = test_query_id_column.value_counts(sort=False) test_label_column = test_set_data_frame['Ranking']
  • 27. London Information Retrieval Meetup Train and test the model with LambdaMART method: Model Training: XGBoost ! LambdaMART model uses gradient boosted decision tree using a cost function derived from LambdaRank for solving a Ranking Task. ! The model performs list-wise ranking where Normalised Discounted Cumulative Gain (NDCG) is maximised. ! List-wise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it ! The Evaluation Measure is an average across the queries.
  • 28. London Information Retrieval Meetup Train and test the model with LambdaMART: params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@10', 'verbosity': 2, 'early_stopping_rounds': 10} watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, ‘train')] print('- - - - Training The Model') xgb_model = xgboost.train(params, training_xgb_matrix, num_boost_round=999, evals=watch_list) print('- - - - Saving XGBoost model’) xgboost_model_json = output_dir + "/xgboost-" + name + ".json" xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True, dump_format='json') Model Training: LambdaMART
  • 29. London Information Retrieval Meetup • DCG@K = Discounted Cumulative Gain@K It measures the usefulness, or gain, of a document based on its position in the result list. Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K • It will be in the range [0,1] Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 14,01 15,76 17,64 22,60 0,62 0,70 0,78 1,0 Evaluation Metric: List-wise and NDCG relevance weight result position DCG NDCG
  • 30. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the model creation: ! One sample per query group ! One Relevance Label for all the samples in a query group: Under Sampled Query Ids can potentially sky rock your NDCG avg Common Mistakes
  • 31. London Information Retrieval Meetup Problem Statement Model Training Data Preprocessing Results
  • 32. London Information Retrieval Meetup Results train-ndcg@10 eval-ndcg@10 Relevance Labels (0-10) 0.7179 0.7351 Relevance Labels (0-20) 0.8018 0.7740 Relevance Labels (0-10) 0.8235 0.7633 Relevance Labels (0-20) 0.8215 0.8244 doc2vec Encoding Hash Encoding NDCG@10, where ‘@10’ denotes that the metric is evaluated only on top 10 documents/songs
  • 33. London Information Retrieval Meetup ! Importance of Data Preprocessing and Feature Engineering ! Language Detection as additional feature ! doc2vec and Relevance Rating [0, 20] as best approaches ! Online testing in LTR evaluation ! Use of the library Tree SHAP for the feature importance https://github.com/slundberg/shap Conclusions