Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

London Information Retrieval Meetup
Evaluating Your Learning to Rank
Model: Dos and Don’ts in Offline/
Online Evaluation
Alessandro Benedetti, Director
Anna Ruggero, R&D Software Engineer
23rd June 2020

London Information Retrieval MeetupWho We Are
Alessandro Benedetti
! Born in Tarquinia(ancient Etruscan city)
! R&D Software Engineer
! Search Consultant
! Director
! Master in Computer Science
! Apache Lucene/Solr Committer
! Semantic, NLP, Machine Leaning
technologies passionate
! Beach Volleyball player and Snowboarder

! R&D Search Software Engineer
! Master Degree in Computer Science
Engineering
! Big Data, Information Retrieval
! Organist, Music lover
Who We Are
Anna Ruggero

● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
www.sease.io
Search Services

London Information Retrieval MeetupClients

London Information Retrieval MeetupOverview
Offline Testing for Business
Build a Test Set
Online Testing for
Business
A/B Testing
Interleaving

Build a Test Set
Online Testing for
Business
A/B Testing
Interleaving

! Find anomalies in data, like: weird distribution of the
features, strange collected values, …
! Check how the model performs before using it in
production: implement improvements, fix bugs, tune
parameters, …
! Save time and money. Put in production a bad model
can worse the user experience on the website.
Advantages:
[Offline] A Business Perspective

London Information Retrieval Meetup[Offline] XGBoost
XGBoost is an optimized distributed gradient boosting library
designed to be highly efficient, flexible and portable.
It implements machine learning algorithms under the Gradient
Boosting framework.
It is Open Source.
https://github.com/dmlc/xgboost

London Information Retrieval Meetup[Offline] Build a Test Set
Relevance
Label
QueryId DocumentId Feature1 Feature2
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create a training set with XGBoost:
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])]
Feature1 Feature2
3.0 2.0
0.0 1.0
3.0 2.5
9.0 4.0
8.0 4.0
3.0 1.0
training_data_set

Relevance
Label
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create the query Id groups:
training_query_id_column = training_set_data_frame[features.QUERY_ID]
training_query_groups = training_query_id_column.value_counts(sort=False)
training_query_id_column
QueryId
1
1
2
2
3
3
QueryId Count
1 2
2 2
3 2
training_query_groups

Relevance
Label
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.5
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
Create the relevance label column:
training_label_column = training_set_data_frame[features.RELEVANCE_LABEL]
Relevance
Label
3
2
4
1
0
2
training_label_column

Create a training set with XGBoost:
training_xgb_matrix = xgb.DMatrix(training_data_set, label=training_label_column)
training_xgb_matrix.set_group(training_query_groups)
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
training_query_id_column = training_set_data_frame[features.QUERY_ID]
training_query_groups = training_query_id_column.value_counts(sort=False)
training_label_column = training_set_data_frame[features.RELEVANCE_LABEL]
[Offline] Build a Test Set

Create a test set with XGBoost:
test_xgb_matrix = xgb.DMatrix(test_data_set, label=test_label_column)
test_xgb_matrix.set_group(test_query_groups)
test_data_set = test_set_data_frame[
test_set_data_frame.columns.difference(
test_query_id_column = test_set_data_frame[features.QUERY_ID]
test_query_groups = test_query_id_column.value_counts(sort=False)
test_label_column = test_set_data_frame[features.RELEVANCE_LABEL]
[Offline] Build a Test Set

Train and test the model with XGBoost:
params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@4','verbosity': 2,
'early_stopping_rounds' : 10}
watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, 'train')]
print('- - - - Training The Model')
xgb_model = xgb.train(params, training_xgb_matrix, num_boost_round=999,
evals=watch_list)
print('- - - - Saving XGBoost model')
xgboost_model_json = output_dir + "/xgboost-" + name + ".json"
xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True,
dump_format='json')
[Offline] Train/Test

Save an XGBoost model:
logging.info('- - - - Saving XGBoost model')
xgboost_model_name = output_dir + "/xgboost-" + name
xgb_model.save_model(xgboost_model_name)
logging.info('- - - - Loading xgboost model')
xgb_model = xgb.Booster()
xgb_model.load_model(model_path)
[Offline] Save/Load Models
Load an XGBoost model:

• precision = Ratio of relevant results among the search results
returned
• precision@K = Ratio of relevant results among the top-k search
results returned
• recall = Ratio of relevant results found among all the relevant results
• recall@k ? = Ratio of all the relevant results, you found in the topK
What happens if : 
 
[Offline] Metrics
means fewer <relevant results> in the top K
means more <relevant results> in the top K
means fewer <relevant results> found among all relevantrecall@k
means more <relevant results> found among all relevantrecall@k
precision@k
precision@k

• DCG@K = Discounted Cumulative Gain@K
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
0.64 0.73 0.79 1.0
[Offline] NDCG
means less <relevant results> in worse positions 
with worse relevance *
NDCG@k
relevance weight
result position
means more <relevant results> in better positions 
with better relevance *
NDCG@k

London Information Retrieval Meetup[Offline] Test a Trained Model
Relevance
Label
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.0
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
test_relevance_labels_per_queryId =
[np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID])
for query_id, data_frame in
test_set_data_frame[[features.RELEVANCE_LABEL,
features.QUERY_ID]].groupby(features.QUERY_ID)]
QueryId Relevance
Label
1 [3,2]
2 [4,1]
3 [0,2]
Relevance
Labels
[3,2]
[4,1]
[0,2]
test_relevance_labels
_per_queryIddata_frame

Relevance
Label
3 1 1 3.0 2.0
2 1 2 0.0 1.0
4 2 2 3.0 2.0
1 2 1 9.0 4.0
0 3 2 8.0 4.0
2 3 1 3.0 1.0
test_set_data_frame =
test_set_data_frame[test_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL,features.DOCUMENT_ID])]
QueryId Feature1 Feature2
1 3.0 2.0
1 0.0 1.0
2 3.0 2.0
2 9.0 4.0
3 8.0 4.0
3 3.0 1.0

test_data_per_queryId = [data_frame.loc[:, data_frame.columns !=
features.QUERY_ID] for query_id, data_frame in
test_set_data_frame.groupby(features.QUERY_ID)]
1 3.0 2.0
1 0.0 1.0
2 3.0 2.0
2 9.0 4.0
3 8.0 4.0
3 3.0 1.0
1 [3,0] [2,1]
2 [3,9] [2,4]
3 [8,3] [4,1]
test_data_per_queryId
Feature1 Feature2
[3,0] [2,1]
[3,9] [2,4]
[8,3] [4,1]
data_frame

Test an already trained XGBoost model.
Prepare the test set:
test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns !=
features.QUERY_ID]) for query_id, data_frame in
test_set_data_frame[[features.RELEVANCE_LABEL,
features.QUERY_ID]].groupby(features.QUERY_ID)]
test_relevance_labels_per_queryId =
[test_relevance_labels.reshape(len(test_relevance_labels),) for test_relevance_labels in
test_relevance_labels_per_queryId]
test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference(
[features.RELEVANCE_LABEL, features.DOCUMENT_ID])]
test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for
query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)]
test_xgb_matrix_list = [xgb.DMatrix(test_set) for test_set in test_data_per_queryId ]
[Offline] Test a Trained Model

Test an already trained xgboost model:
predictions_with_relevance = []
logging.info('- - - - Making predictions')
predictions_list = [xgb_model.predict(test_xgb_matrix) for test_xgb_matrix in test_xgb_matrix_list]
for predictions, labels in zip(predictions_list, test_label_list):
to_data_frame = [list(row) for row in zip(predictions, labels)]
predictions_with_relevance.append(pd.DataFrame(to_data_frame, columns=[‘predicted_scores’,
'relevance_labels']))
predictions_with_relevance = [predictions_per_query.sort_values(by='predicted_score',
ascending=False) for predictions_per_query in predictions_with_relevance]
logging.info('- - - - Ndcg computation')
ndcg_scores_list = []
for predictions_per_query in predictions_with_relevance:
ndcg = ndcg_at_k(predictions_per_query['relevance_label'], len(predictions_per_query))
ndcg_scores_list.append(ndcg)
final_ndcg = statistics.mean(ndcg_scores_list)
logging.info('- - - - The final ndcg is: ' + str(final_ndcg))
[Offline] Test a Trained Model

Let’s see the common mistakes to avoid during the test set
creation:
! One sample per query group:
! If we have a small number of interactions it could happen
during the split that we obtain some queries with just a
single training sample.
In this case the NDCG@K for the query group will be 1!
(independently of the model)
[Offline] Common Mistakes
1 1 1 1
1 1 1 1
3 3 3 3
7 7 7 7
Query1 Query2
DCG

Let’s see the common mistakes to avoid during the test set
creation:
! One sample per query group
! One relevance label for all the samples in a query group:
! During the split we could put all the samples with a single
relevance label in the test set.

Let’s see the common mistakes to avoid during the test set creation:
! One sample per query group
! One relevance label for all the samples of a query group
! Samples considered for the data set creation:
! We have to be sure that we are using realistic set of samples for the
test set creation.
These <query,document> pairs represent the possible user behavior,
so they must have a balance of unknown/known queries with results
mixed in relevance.

Build a Test Set
Online Testing for Business
A/B Testing
Interleaving

► An incorrect or imperfect test set brings us model
evaluation results that aren’t reflecting the real model
improvement/regressions.
► We may get an extremely high evaluation metric
offline, but only because we improperly designed the
test, the model is unfortunately not a good fit
There are several problems that are hard to be detected
with an offline evaluation:
[Online] A Business Perspective

► Finding a direct correlation between the offline evaluation
metrics and the parameters used for the online model
performance evaluation (e.g. revenues, click through rate…).

► Finding a direct correlation between the offline evaluation
metrics and the parameters used for the online model
performance evaluation (e.g. revenues, click through rate…).
► Is based on generated relevance labels that not always
reflect the real user need.

► The reliability of the results: we directly observe the user
behaviour.
► The interpretability of the results: we directly observe the
impact of the model in terms of online metrics the business
cares.
► The possibility to observe the model behavior: we can see
how the user interact with the model and figure out how to
improve it.
Using online testing can lead to many advantages:
[Online] Business Advantages

! Click Through Rates ( views, downloads, add to cart …)
! Sale/Revenue Rates
! Dwell time ( time spent on a search result after the click)
! Query reformulations/ Bounce rates
! ….
Recommendation: test for direct correlation!
When training the model, we probably chose one
objective to optimise (there are also multi objective
learning to rank models)
[Online] Signals to measure

50%
50%
A B
20% 40%
Control Variation
[Online] A/B testing

► Be sure to consider only interactions from result pages
ranked by the models you are comparing.
i.e. not using every clicks, sales, downloads happening in
the site
[Online] A/B Testing Noise
Extra care is needed when implementing A/B Testing.

► Suppose we are analyzing
model A.
We obtain:
10 sales from the homepage and
5 sales from the search page.
model B.
We obtain:
Model A is better than Model B(?)
[Online] A/B Testing Noise 1

model A.
We obtain:
model B.
We obtain:

model B.
We obtain:
model A.
We obtain:

► It reduces the problem with users’ variance due to
their separation in groups (group A and group B).
► It is more sensitive in comparison between models.
► It requires less traffic.
► It requires less time to achieve reliable results.
► It doesn’t necessarily expose a bad model to a sub
population of users.
[Online] Interleaving Advantages

100%
Model A Model B
21 3 1 2 3
1 2 3 4
[Online] Interleaving

London Information Retrieval Meetup[Online] Balanced Interleaving
There are different types of interleaving:
► Balanced Interleaving: alternate insertion with one
model having the priority.

DRAWBACK
► When comparing two very similar models.
► Model A: lA = (a, b, c, d)
► Model B: lB = (b, c, d, a)
► The comparison phase will bring the Model B to win more often
than Model A. This happens regardless of the model chosen
as prior.
► This drawback arises due to:
► the way in which the evaluation of the results is done.
► the fact that model_B rank higher than model_A all documents with
the exception of a.
[Online] Balanced Interleaving

London Information Retrieval Meetup[Online] Team-Draft Interleaving
► Team-Draft Interleaving: method of team captains in
team-matches.
https://issues.apache.org/jira/browse/SOLR-14560

team-matches.
DRAWBACK
► When comparing two very similar models.
► Model A: lA = (a, b, c, d)
► Model B: lB = (b, c, d, a)
► Suppose c to be the only relevant document.
► With this approach we can obtain four different interleaved
lists:
► lI1 = (aA, bB, cA, dB)
► lI2 = (bB, aA, cB, dA)
► lI3 = (bB, aA, cA, dB)
► lI4 = (aA, bB, cB, dA)
► All of them putting c at the same rank.
Tie!
But Model B should be chosen
as the best model!
[Online] Team-Draft Interleaving

team-matches.
► Probabilistic Interleaving: rely on probability
distributions. Every documents have a non-zero
probability to be added in the interleaved result list.
[Online] Probabilistic Interleaving

team-matches.
► Probabilistic Interleaving: rely on probability
distributions. Every documents have a non-zero
probability to be added in the interleaved result list.
DRAWBACK
The use of probability distribution could lead to a worse user
experience. Less relevant document could be put higher.
[Online] Probabilistic Interleaving

► Both Offline/Online Learning To Rank evaluations are vital
for a business
► Offline
- doesn’t affect production
- allows research and
experimentation of wild
ideas
- reduces the number of
Online Experiments to run
► Online
- measures
improvements/regressions
with real users
- isolates the benefits coming
from the Learning To Rank
models
Conclusions

London Information Retrieval MeetupThanks!

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

Similar to Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation (20)

More from Sease

More from Sease (20)

Recently uploaded

Recently uploaded (20)

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation