SlideShare a Scribd company logo
1 of 51
Download to read offline
London Information Retrieval Meetup
Evaluating Your Learning to Rank
Model: Dos and Don’ts in Offline/
Online Evaluation
Alessandro Benedetti, Director
Anna Ruggero, R&D Software Engineer
23rd June 2020
London Information Retrieval MeetupWho We Are
Alessandro Benedetti
! Born in Tarquinia(ancient Etruscan city)
! R&D Software Engineer
! Search Consultant
! Director
! Master in Computer Science
! Apache Lucene/Solr Committer
! Semantic, NLP, Machine Leaning
technologies passionate
! Beach Volleyball player and Snowboarder
London Information Retrieval Meetup
! R&D Search Software Engineer
! Master Degree in Computer Science
Engineering
! Big Data, Information Retrieval
! Organist, Music lover
Who We Are
Anna Ruggero
London Information Retrieval Meetup
● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
www.sease.io
Search Services
London Information Retrieval MeetupClients
London Information Retrieval MeetupOverview
Offline Testing for Business
Build a Test Set
Online Testing for
Business
A/B Testing
Interleaving
London Information Retrieval Meetup
Offline Testing for Business
Build a Test Set
Online Testing for
Business
A/B Testing
Interleaving
London Information Retrieval Meetup
! Find anomalies in data, like: weird distribution of the
features, strange collected values, …
! Check how the model performs before using it in
production: implement improvements, fix bugs, tune
parameters, …
! Save time and money. Put in production a bad model
can worse the user experience on the website.
Advantages:
[Offline] A Business Perspective
London Information Retrieval Meetup
Offline Testing for Business
Build a Test Set
Online Testing for
Business
A/B Testing
Interleaving
London Information Retrieval Meetup[Offline] XGBoost
XGBoost is an optimized distributed gradient boosting library
designed to be highly efficient, flexible and portable.
It implements machine learning algorithms under the Gradient
Boosting framework.
It is Open Source.
https://github.com/dmlc/xgboost
London Information Retrieval Meetup[Offline] Build a Test Set
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create a training set with XGBoost:
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])]
Feature1 Feature2
3.0 2.0
0.0 1.0
3.0 2.5
9.0 4.0
8.0 4.0
3.0 1.0
training_data_set
London Information Retrieval Meetup[Offline] Build a Test Set
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create the query Id groups:
training_query_id_column = training_set_data_frame[features.QUERY_ID]
training_query_groups = training_query_id_column.value_counts(sort=False)
training_query_id_column
QueryId
1
1
2
2
3
3
QueryId Count
1 2
2 2
3 2
training_query_groups
London Information Retrieval Meetup[Offline] Build a Test Set
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create the relevance label column:
training_label_column = training_set_data_frame[features.RELEVANCE_LABEL]
Relevance
Label
3
2
4
1
0
2
training_label_column
London Information Retrieval Meetup
Create a training set with XGBoost:
training_xgb_matrix = xgb.DMatrix(training_data_set, label=training_label_column)
training_xgb_matrix.set_group(training_query_groups)
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])]
training_query_id_column = training_set_data_frame[features.QUERY_ID]
training_query_groups = training_query_id_column.value_counts(sort=False)
training_label_column = training_set_data_frame[features.RELEVANCE_LABEL]
[Offline] Build a Test Set
London Information Retrieval Meetup
Create a test set with XGBoost:
test_xgb_matrix = xgb.DMatrix(test_data_set, label=test_label_column)
test_xgb_matrix.set_group(test_query_groups)
test_data_set = test_set_data_frame[
test_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])]
test_query_id_column = test_set_data_frame[features.QUERY_ID]
test_query_groups = test_query_id_column.value_counts(sort=False)
test_label_column = test_set_data_frame[features.RELEVANCE_LABEL]
[Offline] Build a Test Set
London Information Retrieval Meetup
Train and test the model with XGBoost:
params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@4','verbosity': 2,
'early_stopping_rounds' : 10}
watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, 'train')]
print('- - - - Training The Model')
xgb_model = xgb.train(params, training_xgb_matrix, num_boost_round=999,
evals=watch_list)
print('- - - - Saving XGBoost model')
xgboost_model_json = output_dir + "/xgboost-" + name + ".json"
xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True,
dump_format='json')
[Offline] Train/Test
London Information Retrieval Meetup
Save an XGBoost model:
logging.info('- - - - Saving XGBoost model')
xgboost_model_name = output_dir + "/xgboost-" + name
xgb_model.save_model(xgboost_model_name)
logging.info('- - - - Loading xgboost model')
xgb_model = xgb.Booster()
xgb_model.load_model(model_path)
[Offline] Save/Load Models
Load an XGBoost model:
London Information Retrieval Meetup
• precision = Ratio of relevant results among the search results
returned
• precision@K = Ratio of relevant results among the top-k search
results returned
• recall = Ratio of relevant results found among all the relevant results
• recall@k ? = Ratio of all the relevant results, you found in the topK
What happens if :



[Offline] Metrics
means fewer <relevant results> in the top K
means more <relevant results> in the top K
means fewer <relevant results> found among all relevantrecall@k
means more <relevant results> found among all relevantrecall@k
precision@k
precision@k
London Information Retrieval Meetup
• DCG@K = Discounted Cumulative Gain@K
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
0.64 0.73 0.79 1.0
[Offline] NDCG
means less <relevant results> in worse positions

with worse relevance *
NDCG@k
relevance weight
result position
means more <relevant results> in better positions

with better relevance *
NDCG@k
London Information Retrieval Meetup[Offline] Test a Trained Model
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.0
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
test_relevance_labels_per_queryId =
[np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID])
for query_id, data_frame in
test_set_data_frame[[features.RELEVANCE_LABEL,
features.QUERY_ID]].groupby(features.QUERY_ID)]
QueryId Relevance
Label
1 [3,2]
2 [4,1]
3 [0,2]
Relevance
Labels
[3,2]
[4,1]
[0,2]
test_relevance_labels
_per_queryIddata_frame
London Information Retrieval Meetup[Offline] Test a Trained Model
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.0
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
test_set_data_frame =
test_set_data_frame[test_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL,features.DOCUMENT_ID])]
QueryId Feature1 Feature2
1 3.0 2.0
1 0.0 1.0
2 3.0 2.0
2 9.0 4.0
3 8.0 4.0
3 3.0 1.0
London Information Retrieval Meetup[Offline] Test a Trained Model
test_data_per_queryId = [data_frame.loc[:, data_frame.columns !=
features.QUERY_ID] for query_id, data_frame in
test_set_data_frame.groupby(features.QUERY_ID)]
QueryId Feature1 Feature2
1 3.0 2.0
1 0.0 1.0
2 3.0 2.0
2 9.0 4.0
3 8.0 4.0
3 3.0 1.0
QueryId Feature1 Feature2
1 [3,0] [2,1]
2 [3,9] [2,4]
3 [8,3] [4,1]
test_data_per_queryId
Feature1 Feature2
[3,0] [2,1]
[3,9] [2,4]
[8,3] [4,1]
data_frame
London Information Retrieval Meetup
Test an already trained XGBoost model.
Prepare the test set:
test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns !=
features.QUERY_ID]) for query_id, data_frame in
test_set_data_frame[[features.RELEVANCE_LABEL,
features.QUERY_ID]].groupby(features.QUERY_ID)]
test_relevance_labels_per_queryId =
[test_relevance_labels.reshape(len(test_relevance_labels),) for test_relevance_labels in
test_relevance_labels_per_queryId]
test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID])]
test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for
query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)]
test_xgb_matrix_list = [xgb.DMatrix(test_set) for test_set in test_data_per_queryId ]
[Offline] Test a Trained Model
London Information Retrieval Meetup
Test an already trained xgboost model:
predictions_with_relevance = []
logging.info('- - - - Making predictions')
predictions_list = [xgb_model.predict(test_xgb_matrix) for test_xgb_matrix in test_xgb_matrix_list]
for predictions, labels in zip(predictions_list, test_label_list):
to_data_frame = [list(row) for row in zip(predictions, labels)]
predictions_with_relevance.append(pd.DataFrame(to_data_frame, columns=[‘predicted_scores’,
'relevance_labels']))
predictions_with_relevance = [predictions_per_query.sort_values(by='predicted_score',
ascending=False) for predictions_per_query in predictions_with_relevance]
logging.info('- - - - Ndcg computation')
ndcg_scores_list = []
for predictions_per_query in predictions_with_relevance:
ndcg = ndcg_at_k(predictions_per_query['relevance_label'], len(predictions_per_query))
ndcg_scores_list.append(ndcg)
final_ndcg = statistics.mean(ndcg_scores_list)
logging.info('- - - - The final ndcg is: ' + str(final_ndcg))
[Offline] Test a Trained Model
London Information Retrieval Meetup
Let’s see the common mistakes to avoid during the test set
creation:
! One sample per query group:
! If we have a small number of interactions it could happen
during the split that we obtain some queries with just a
single training sample.
In this case the NDCG@K for the query group will be 1!
(independently of the model)
[Offline] Common Mistakes
Model1 Model2 Model3 Ideal
1 1 1 1
1 1 1 1
Model1 Model2 Model3 Ideal
3 3 3 3
7 7 7 7
Query1 Query2
DCG
London Information Retrieval Meetup
Let’s see the common mistakes to avoid during the test set
creation:
! One sample per query group
! One relevance label for all the samples in a query group:
! During the split we could put all the samples with a single
relevance label in the test set.
[Offline] Common Mistakes
London Information Retrieval Meetup
Let’s see the common mistakes to avoid during the test set creation:
! One sample per query group
! One relevance label for all the samples of a query group
! Samples considered for the data set creation:
! We have to be sure that we are using realistic set of samples for the
test set creation.
These <query,document> pairs represent the possible user behavior,
so they must have a balance of unknown/known queries with results
mixed in relevance.
[Offline] Common Mistakes
London Information Retrieval Meetup
Offline Testing for Business
Build a Test Set
Online Testing for Business
A/B Testing
Interleaving
London Information Retrieval Meetup
► An incorrect or imperfect test set brings us model
evaluation results that aren’t reflecting the real model
improvement/regressions.
► We may get an extremely high evaluation metric
offline, but only because we improperly designed the
test, the model is unfortunately not a good fit
There are several problems that are hard to be detected
with an offline evaluation:
[Online] A Business Perspective
London Information Retrieval Meetup
► An incorrect or imperfect test set brings us model
evaluation results that aren’t reflecting the real model
improvement/regressions.
► Finding a direct correlation between the offline evaluation
metrics and the parameters used for the online model
performance evaluation (e.g. revenues, click through rate…).
There are several problems that are hard to be detected
with an offline evaluation:
[Online] A Business Perspective
London Information Retrieval Meetup
► An incorrect or imperfect test set brings us model
evaluation results that aren’t reflecting the real model
improvement/regressions.
► Finding a direct correlation between the offline evaluation
metrics and the parameters used for the online model
performance evaluation (e.g. revenues, click through rate…).
► Is based on generated relevance labels that not always
reflect the real user need.
[Online] A Business Perspective
There are several problems that are hard to be detected
with an offline evaluation:
London Information Retrieval Meetup
► The reliability of the results: we directly observe the user
behaviour.
► The interpretability of the results: we directly observe the
impact of the model in terms of online metrics the business
cares.
► The possibility to observe the model behavior: we can see
how the user interact with the model and figure out how to
improve it.
Using online testing can lead to many advantages:
[Online] Business Advantages
London Information Retrieval Meetup
! Click Through Rates ( views, downloads, add to cart …)
! Sale/Revenue Rates
! Dwell time ( time spent on a search result after the click)
! Query reformulations/ Bounce rates
! ….
Recommendation: test for direct correlation!
When training the model, we probably chose one
objective to optimise (there are also multi objective
learning to rank models)
[Online] Signals to measure
London Information Retrieval Meetup
Offline Testing for Business
Build a Test Set
Online Testing for Business
A/B Testing
Interleaving
London Information Retrieval Meetup
50%
50%
A B
20% 40%
Control Variation
[Online] A/B testing
London Information Retrieval Meetup
► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
i.e. not using every clicks, sales, downloads happening in
the site
[Online] A/B Testing Noise
Extra care is needed when implementing A/B Testing.
London Information Retrieval Meetup
► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
Extra care is needed when implementing A/B Testing.
► Suppose we are analyzing
model A.
We obtain:
10 sales from the homepage and
5 sales from the search page.
► Suppose we are analyzing
model B.
We obtain:
4 sales from the homepage and
10 sales from the search page.
Model A is better than Model B(?)
[Online] A/B Testing Noise 1
London Information Retrieval Meetup
► Suppose we are analyzing
model A.
We obtain:
10 sales from the homepage and
5 sales from the search page.
► Suppose we are analyzing
model B.
We obtain:
4 sales from the homepage and
10 sales from the search page.
[Online] A/B Testing Noise 1
Model A is better than Model B(?)
Extra care is needed when implementing A/B Testing.
► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
London Information Retrieval Meetup
► Suppose we are analyzing
model B.
We obtain:
5 sales from the homepage and
10 sales from the search page.
► Suppose we are analyzing
model A.
We obtain:
12 sales from the homepage and
11 sales from the search page.
[Online] A/B Testing Noise 2
Extra care is needed when implementing A/B Testing.
► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
Model A is better than Model B(?)
London Information Retrieval Meetup
Model A is better than Model B(?)
► Suppose we are analyzing
model B.
We obtain:
5 sales from the homepage and
10 sales from the search page.
► Suppose we are analyzing
model A.
We obtain:
12 sales from the homepage and
11 sales from the search page.
[Online] A/B Testing Noise 2
► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
Extra care is needed when implementing A/B Testing.
London Information Retrieval Meetup
Offline Testing for Business
Build a Test Set
Online Testing for Business
A/B Testing
Interleaving
London Information Retrieval Meetup
► It reduces the problem with users’ variance due to
their separation in groups (group A and group B).
► It is more sensitive in comparison between models.
► It requires less traffic.
► It requires less time to achieve reliable results.
► It doesn’t necessarily expose a bad model to a sub
population of users.
[Online] Interleaving Advantages
London Information Retrieval Meetup
100%
Model A Model B
21 3 1 2 3
1 2 3 4
[Online] Interleaving
London Information Retrieval Meetup[Online] Balanced Interleaving
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
London Information Retrieval Meetup
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
DRAWBACK
► When comparing two very similar models.
► Model A: lA = (a, b, c, d)
► Model B: lB = (b, c, d, a)
► The comparison phase will bring the Model B to win more often
than Model A. This happens regardless of the model chosen
as prior.
► This drawback arises due to:
► the way in which the evaluation of the results is done.
► the fact that model_B rank higher than model_A all documents with
the exception of a.
[Online] Balanced Interleaving
London Information Retrieval Meetup[Online] Team-Draft Interleaving
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
► Team-Draft Interleaving: method of team captains in
team-matches.
https://issues.apache.org/jira/browse/SOLR-14560
London Information Retrieval Meetup
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
► Team-Draft Interleaving: method of team captains in
team-matches.
DRAWBACK
► When comparing two very similar models.
► Model A: lA = (a, b, c, d)
► Model B: lB = (b, c, d, a)
► Suppose c to be the only relevant document.
► With this approach we can obtain four different interleaved
lists:
► lI1 = (aA, bB, cA, dB)
► lI2 = (bB, aA, cB, dA)
► lI3 = (bB, aA, cA, dB)
► lI4 = (aA, bB, cB, dA)
► All of them putting c at the same rank.
Tie!
But Model B should be chosen
as the best model!
[Online] Team-Draft Interleaving
London Information Retrieval Meetup
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
► Team-Draft Interleaving: method of team captains in
team-matches.
► Probabilistic Interleaving: rely on probability
distributions. Every documents have a non-zero
probability to be added in the interleaved result list.
[Online] Probabilistic Interleaving
London Information Retrieval Meetup
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.
► Team-Draft Interleaving: method of team captains in
team-matches.
► Probabilistic Interleaving: rely on probability
distributions. Every documents have a non-zero
probability to be added in the interleaved result list.
DRAWBACK
The use of probability distribution could lead to a worse user
experience. Less relevant document could be put higher.
[Online] Probabilistic Interleaving
London Information Retrieval Meetup
► Both Offline/Online Learning To Rank evaluations are vital
for a business
► Offline
- doesn’t affect production
- allows research and
experimentation of wild
ideas
- reduces the number of
Online Experiments to run
► Online
- measures
improvements/regressions
with real users
- isolates the benefits coming
from the Learning To Rank
models
Conclusions
London Information Retrieval MeetupThanks!

More Related Content

What's hot

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...OpenSource Connections
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Multi-Task Learning for NLP
Multi-Task Learning for NLPMulti-Task Learning for NLP
Multi-Task Learning for NLPMotoki Sato
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedBodla Kumar
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesNeo4j
 
Purely Functional Data Structures in Scala
Purely Functional Data Structures in ScalaPurely Functional Data Structures in Scala
Purely Functional Data Structures in ScalaVladimir Kostyukov
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Edureka!
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.
 
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...Qi Guo
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법홍수 허
 
Peeking inside the engine of ZIO SQL.pdf
Peeking inside the engine of ZIO SQL.pdfPeeking inside the engine of ZIO SQL.pdf
Peeking inside the engine of ZIO SQL.pdfJaroslavRegec1
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 

What's hot (20)

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Multi-Task Learning for NLP
Multi-Task Learning for NLPMulti-Task Learning for NLP
Multi-Task Learning for NLP
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn Feed
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
 
Purely Functional Data Structures in Scala
Purely Functional Data Structures in ScalaPurely Functional Data Structures in Scala
Purely Functional Data Structures in Scala
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
Talent Search and Recommendation Systems at LinkedIn: Practical Challenges an...
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Solr formation Sparna
Solr formation SparnaSolr formation Sparna
Solr formation Sparna
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법
 
Peeking inside the engine of ZIO SQL.pdf
Peeking inside the engine of ZIO SQL.pdfPeeking inside the engine of ZIO SQL.pdf
Peeking inside the engine of ZIO SQL.pdf
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 

Similar to Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solrTrey Grainger
 
Building a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrBuilding a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrlucenerevolution
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...Fabio Franzini
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Djangokenluck2001
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackSease
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
gDayX 2013 - Advanced AngularJS - Nicolas Embleton
gDayX 2013 - Advanced AngularJS - Nicolas EmbletongDayX 2013 - Advanced AngularJS - Nicolas Embleton
gDayX 2013 - Advanced AngularJS - Nicolas EmbletonGeorge Nguyen
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
 

Similar to Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation (20)

Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solr
 
Building a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrBuilding a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solr
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Ember
EmberEmber
Ember
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Data access
Data accessData access
Data access
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - Haystack
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
MongoDB and Python
MongoDB and PythonMongoDB and Python
MongoDB and Python
 
gDayX 2013 - Advanced AngularJS - Nicolas Embleton
gDayX 2013 - Advanced AngularJS - Nicolas EmbletongDayX 2013 - Advanced AngularJS - Nicolas Embleton
gDayX 2013 - Advanced AngularJS - Nicolas Embleton
 
Learning with F#
Learning with F#Learning with F#
Learning with F#
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

  • 1. London Information Retrieval Meetup Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/ Online Evaluation Alessandro Benedetti, Director Anna Ruggero, R&D Software Engineer 23rd June 2020
  • 2. London Information Retrieval MeetupWho We Are Alessandro Benedetti ! Born in Tarquinia(ancient Etruscan city) ! R&D Software Engineer ! Search Consultant ! Director ! Master in Computer Science ! Apache Lucene/Solr Committer ! Semantic, NLP, Machine Leaning technologies passionate ! Beach Volleyball player and Snowboarder
  • 3. London Information Retrieval Meetup ! R&D Search Software Engineer ! Master Degree in Computer Science Engineering ! Big Data, Information Retrieval ! Organist, Music lover Who We Are Anna Ruggero
  • 4. London Information Retrieval Meetup ● Headquarter in London/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Es experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning www.sease.io Search Services
  • 6. London Information Retrieval MeetupOverview Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 7. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 8. London Information Retrieval Meetup ! Find anomalies in data, like: weird distribution of the features, strange collected values, … ! Check how the model performs before using it in production: implement improvements, fix bugs, tune parameters, … ! Save time and money. Put in production a bad model can worse the user experience on the website. Advantages: [Offline] A Business Perspective
  • 9. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 10. London Information Retrieval Meetup[Offline] XGBoost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. It is Open Source. https://github.com/dmlc/xgboost
  • 11. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create a training set with XGBoost: training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] Feature1 Feature2 3.0 2.0 0.0 1.0 3.0 2.5 9.0 4.0 8.0 4.0 3.0 1.0 training_data_set
  • 12. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create the query Id groups: training_query_id_column = training_set_data_frame[features.QUERY_ID] training_query_groups = training_query_id_column.value_counts(sort=False) training_query_id_column QueryId 1 1 2 2 3 3 QueryId Count 1 2 2 2 3 2 training_query_groups
  • 13. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create the relevance label column: training_label_column = training_set_data_frame[features.RELEVANCE_LABEL] Relevance Label 3 2 4 1 0 2 training_label_column
  • 14. London Information Retrieval Meetup Create a training set with XGBoost: training_xgb_matrix = xgb.DMatrix(training_data_set, label=training_label_column) training_xgb_matrix.set_group(training_query_groups) training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] training_query_id_column = training_set_data_frame[features.QUERY_ID] training_query_groups = training_query_id_column.value_counts(sort=False) training_label_column = training_set_data_frame[features.RELEVANCE_LABEL] [Offline] Build a Test Set
  • 15. London Information Retrieval Meetup Create a test set with XGBoost: test_xgb_matrix = xgb.DMatrix(test_data_set, label=test_label_column) test_xgb_matrix.set_group(test_query_groups) test_data_set = test_set_data_frame[ test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] test_query_id_column = test_set_data_frame[features.QUERY_ID] test_query_groups = test_query_id_column.value_counts(sort=False) test_label_column = test_set_data_frame[features.RELEVANCE_LABEL] [Offline] Build a Test Set
  • 16. London Information Retrieval Meetup Train and test the model with XGBoost: params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@4','verbosity': 2, 'early_stopping_rounds' : 10} watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, 'train')] print('- - - - Training The Model') xgb_model = xgb.train(params, training_xgb_matrix, num_boost_round=999, evals=watch_list) print('- - - - Saving XGBoost model') xgboost_model_json = output_dir + "/xgboost-" + name + ".json" xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True, dump_format='json') [Offline] Train/Test
  • 17. London Information Retrieval Meetup Save an XGBoost model: logging.info('- - - - Saving XGBoost model') xgboost_model_name = output_dir + "/xgboost-" + name xgb_model.save_model(xgboost_model_name) logging.info('- - - - Loading xgboost model') xgb_model = xgb.Booster() xgb_model.load_model(model_path) [Offline] Save/Load Models Load an XGBoost model:
  • 18. London Information Retrieval Meetup • precision = Ratio of relevant results among the search results returned • precision@K = Ratio of relevant results among the top-k search results returned • recall = Ratio of relevant results found among all the relevant results • recall@k ? = Ratio of all the relevant results, you found in the topK What happens if :
 
 [Offline] Metrics means fewer <relevant results> in the top K means more <relevant results> in the top K means fewer <relevant results> found among all relevantrecall@k means more <relevant results> found among all relevantrecall@k precision@k precision@k
  • 19. London Information Retrieval Meetup • DCG@K = Discounted Cumulative Gain@K Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0.64 0.73 0.79 1.0 [Offline] NDCG means less <relevant results> in worse positions
 with worse relevance * NDCG@k relevance weight result position means more <relevant results> in better positions
 with better relevance * NDCG@k
  • 20. London Information Retrieval Meetup[Offline] Test a Trained Model Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.0 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID]) for query_id, data_frame in test_set_data_frame[[features.RELEVANCE_LABEL, features.QUERY_ID]].groupby(features.QUERY_ID)] QueryId Relevance Label 1 [3,2] 2 [4,1] 3 [0,2] Relevance Labels [3,2] [4,1] [0,2] test_relevance_labels _per_queryIddata_frame
  • 21. London Information Retrieval Meetup[Offline] Test a Trained Model Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.0 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL,features.DOCUMENT_ID])] QueryId Feature1 Feature2 1 3.0 2.0 1 0.0 1.0 2 3.0 2.0 2 9.0 4.0 3 8.0 4.0 3 3.0 1.0
  • 22. London Information Retrieval Meetup[Offline] Test a Trained Model test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)] QueryId Feature1 Feature2 1 3.0 2.0 1 0.0 1.0 2 3.0 2.0 2 9.0 4.0 3 8.0 4.0 3 3.0 1.0 QueryId Feature1 Feature2 1 [3,0] [2,1] 2 [3,9] [2,4] 3 [8,3] [4,1] test_data_per_queryId Feature1 Feature2 [3,0] [2,1] [3,9] [2,4] [8,3] [4,1] data_frame
  • 23. London Information Retrieval Meetup Test an already trained XGBoost model. Prepare the test set: test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID]) for query_id, data_frame in test_set_data_frame[[features.RELEVANCE_LABEL, features.QUERY_ID]].groupby(features.QUERY_ID)] test_relevance_labels_per_queryId = [test_relevance_labels.reshape(len(test_relevance_labels),) for test_relevance_labels in test_relevance_labels_per_queryId] test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID])] test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)] test_xgb_matrix_list = [xgb.DMatrix(test_set) for test_set in test_data_per_queryId ] [Offline] Test a Trained Model
  • 24. London Information Retrieval Meetup Test an already trained xgboost model: predictions_with_relevance = [] logging.info('- - - - Making predictions') predictions_list = [xgb_model.predict(test_xgb_matrix) for test_xgb_matrix in test_xgb_matrix_list] for predictions, labels in zip(predictions_list, test_label_list): to_data_frame = [list(row) for row in zip(predictions, labels)] predictions_with_relevance.append(pd.DataFrame(to_data_frame, columns=[‘predicted_scores’, 'relevance_labels'])) predictions_with_relevance = [predictions_per_query.sort_values(by='predicted_score', ascending=False) for predictions_per_query in predictions_with_relevance] logging.info('- - - - Ndcg computation') ndcg_scores_list = [] for predictions_per_query in predictions_with_relevance: ndcg = ndcg_at_k(predictions_per_query['relevance_label'], len(predictions_per_query)) ndcg_scores_list.append(ndcg) final_ndcg = statistics.mean(ndcg_scores_list) logging.info('- - - - The final ndcg is: ' + str(final_ndcg)) [Offline] Test a Trained Model
  • 25. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group: ! If we have a small number of interactions it could happen during the split that we obtain some queries with just a single training sample. In this case the NDCG@K for the query group will be 1! (independently of the model) [Offline] Common Mistakes Model1 Model2 Model3 Ideal 1 1 1 1 1 1 1 1 Model1 Model2 Model3 Ideal 3 3 3 3 7 7 7 7 Query1 Query2 DCG
  • 26. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group ! One relevance label for all the samples in a query group: ! During the split we could put all the samples with a single relevance label in the test set. [Offline] Common Mistakes
  • 27. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group ! One relevance label for all the samples of a query group ! Samples considered for the data set creation: ! We have to be sure that we are using realistic set of samples for the test set creation. These <query,document> pairs represent the possible user behavior, so they must have a balance of unknown/known queries with results mixed in relevance. [Offline] Common Mistakes
  • 28. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 29. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► We may get an extremely high evaluation metric offline, but only because we improperly designed the test, the model is unfortunately not a good fit There are several problems that are hard to be detected with an offline evaluation: [Online] A Business Perspective
  • 30. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► Finding a direct correlation between the offline evaluation metrics and the parameters used for the online model performance evaluation (e.g. revenues, click through rate…). There are several problems that are hard to be detected with an offline evaluation: [Online] A Business Perspective
  • 31. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► Finding a direct correlation between the offline evaluation metrics and the parameters used for the online model performance evaluation (e.g. revenues, click through rate…). ► Is based on generated relevance labels that not always reflect the real user need. [Online] A Business Perspective There are several problems that are hard to be detected with an offline evaluation:
  • 32. London Information Retrieval Meetup ► The reliability of the results: we directly observe the user behaviour. ► The interpretability of the results: we directly observe the impact of the model in terms of online metrics the business cares. ► The possibility to observe the model behavior: we can see how the user interact with the model and figure out how to improve it. Using online testing can lead to many advantages: [Online] Business Advantages
  • 33. London Information Retrieval Meetup ! Click Through Rates ( views, downloads, add to cart …) ! Sale/Revenue Rates ! Dwell time ( time spent on a search result after the click) ! Query reformulations/ Bounce rates ! …. Recommendation: test for direct correlation! When training the model, we probably chose one objective to optimise (there are also multi objective learning to rank models) [Online] Signals to measure
  • 34. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 35. London Information Retrieval Meetup 50% 50% A B 20% 40% Control Variation [Online] A/B testing
  • 36. London Information Retrieval Meetup ► Be sure to consider only interactions from result pages ranked by the models you are comparing. i.e. not using every clicks, sales, downloads happening in the site [Online] A/B Testing Noise Extra care is needed when implementing A/B Testing.
  • 37. London Information Retrieval Meetup ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Extra care is needed when implementing A/B Testing. ► Suppose we are analyzing model A. We obtain: 10 sales from the homepage and 5 sales from the search page. ► Suppose we are analyzing model B. We obtain: 4 sales from the homepage and 10 sales from the search page. Model A is better than Model B(?) [Online] A/B Testing Noise 1
  • 38. London Information Retrieval Meetup ► Suppose we are analyzing model A. We obtain: 10 sales from the homepage and 5 sales from the search page. ► Suppose we are analyzing model B. We obtain: 4 sales from the homepage and 10 sales from the search page. [Online] A/B Testing Noise 1 Model A is better than Model B(?) Extra care is needed when implementing A/B Testing. ► Be sure to consider only interactions from result pages ranked by the models you are comparing.
  • 39. London Information Retrieval Meetup ► Suppose we are analyzing model B. We obtain: 5 sales from the homepage and 10 sales from the search page. ► Suppose we are analyzing model A. We obtain: 12 sales from the homepage and 11 sales from the search page. [Online] A/B Testing Noise 2 Extra care is needed when implementing A/B Testing. ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Model A is better than Model B(?)
  • 40. London Information Retrieval Meetup Model A is better than Model B(?) ► Suppose we are analyzing model B. We obtain: 5 sales from the homepage and 10 sales from the search page. ► Suppose we are analyzing model A. We obtain: 12 sales from the homepage and 11 sales from the search page. [Online] A/B Testing Noise 2 ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Extra care is needed when implementing A/B Testing.
  • 41. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  • 42. London Information Retrieval Meetup ► It reduces the problem with users’ variance due to their separation in groups (group A and group B). ► It is more sensitive in comparison between models. ► It requires less traffic. ► It requires less time to achieve reliable results. ► It doesn’t necessarily expose a bad model to a sub population of users. [Online] Interleaving Advantages
  • 43. London Information Retrieval Meetup 100% Model A Model B 21 3 1 2 3 1 2 3 4 [Online] Interleaving
  • 44. London Information Retrieval Meetup[Online] Balanced Interleaving There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority.
  • 45. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. DRAWBACK ► When comparing two very similar models. ► Model A: lA = (a, b, c, d) ► Model B: lB = (b, c, d, a) ► The comparison phase will bring the Model B to win more often than Model A. This happens regardless of the model chosen as prior. ► This drawback arises due to: ► the way in which the evaluation of the results is done. ► the fact that model_B rank higher than model_A all documents with the exception of a. [Online] Balanced Interleaving
  • 46. London Information Retrieval Meetup[Online] Team-Draft Interleaving There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. https://issues.apache.org/jira/browse/SOLR-14560
  • 47. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. DRAWBACK ► When comparing two very similar models. ► Model A: lA = (a, b, c, d) ► Model B: lB = (b, c, d, a) ► Suppose c to be the only relevant document. ► With this approach we can obtain four different interleaved lists: ► lI1 = (aA, bB, cA, dB) ► lI2 = (bB, aA, cB, dA) ► lI3 = (bB, aA, cA, dB) ► lI4 = (aA, bB, cB, dA) ► All of them putting c at the same rank. Tie! But Model B should be chosen as the best model! [Online] Team-Draft Interleaving
  • 48. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. ► Probabilistic Interleaving: rely on probability distributions. Every documents have a non-zero probability to be added in the interleaved result list. [Online] Probabilistic Interleaving
  • 49. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. ► Probabilistic Interleaving: rely on probability distributions. Every documents have a non-zero probability to be added in the interleaved result list. DRAWBACK The use of probability distribution could lead to a worse user experience. Less relevant document could be put higher. [Online] Probabilistic Interleaving
  • 50. London Information Retrieval Meetup ► Both Offline/Online Learning To Rank evaluations are vital for a business ► Offline - doesn’t affect production - allows research and experimentation of wild ideas - reduces the number of Online Experiments to run ► Online - measures improvements/regressions with real users - isolates the benefits coming from the Learning To Rank models Conclusions