SlideShare a Scribd company logo
1 of 23
Download to read offline
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
On Target Item Sampling
in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
Virtual Event, Brazil, September 2020
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation
Is system A better than B?
A B
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
···
···
Offline evaluation
Rank
Compute
metrics
Test data
Training
Unrated
Test
Set of
all items
Getting it right
• Correlate with production setting / online evaluation
• Consistent & comparable with other offline experiments
A > B?
A B
Training
data
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation – Target items
A
• Exclude items with training data
• Include a certain number of non-
relevant items (e.g. to reduce cost)
• Can this change the outcome?
Set of
all items
B
Compute
metrics
Test data
A > B?
Training
Unrated
Test
Training
data
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Training
Unrated
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Unrated
Test + all unrated
All items except
training items
Largest
Target items
Test + no unrated
Just test ratings
Smallest
Target
items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test + all unrated
Target items
Test + no unrated
Test
Liked
Not liked
Unrated
Test + some unrated
Target items
May the number of unrated target items
affect the evaluation outcome?
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
Ranking A
1
2
3
4
5
All unrated
Ranking B
1
2
3
4
5
P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5
Ranking A
No unrated
Ranking B
May this affect the evaluation outcome?
> <
1
2
3
1
2
3
4
5
4
5
Test
Unrated
Liked
Not liked
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
0
0.2
0.4
0.6
Full Test
P@10P@10
0
0.6
0.4
0.2
8 systems
iMF (full)
iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average rating
Popularity
Random
0
0.2
0.4
0.6
Full Test
P@10
0
0.2
0.4
0.6
Full Test
P@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Best
system
Worst
system
P@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Kendall 𝛕 = 𝟎. 𝟏𝟒
A simple offline experiment on MovieLens 1M
Best
No unrated
Worst
All unrated
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Biased disagreement
Systematic disagreements Many Few
Unrated target items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement
Which one is right?
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Many Few
Unrated target items
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement – Is either one right?
Since we want to match
online evaluation,
let’s compare to
unbiased evaluation
Few
unrated items
Many
unrated items
Unbiased
evaluation
Which one is right?
Yahoo! R3
MAR ratings → Unbiased evaluation
MNAR ratings → Biased evaluation
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Comparison to unbiased evaluation
Biased vs. unbiased evaluation
with Yahoo! R3
Neither all nor zero unrated targets
match unbiased evaluation well
How about something in between?
Let’s explore the target size range…
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
“Sweet spot” Kendall 
0
1
2
5
10
20
50
100
200
500
All
0
1
0.8
0.6
0.4
0.2
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
MovieLens 1M
# unrated target items
“Sweet spot”
?
No MAR data
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
Why do ties increase in the extremes?
• Few unrated items: small set of items to rank
• Many unrated items: metric → 0 as # unrated → 
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items

Sweet spot?
# ties
# ties
Kendall 
Many ties
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
All
1000
2000
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: 𝑝-values
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
The number of ties seems more informative than 𝑝-values
# unrated target items
Sweet spot?
Sum of
𝑝-values
# ties
Kendall 
# ties
Sum of
𝑝-values
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
0
40
20
0
100
300
0
All
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Loss of coverage
Small target sets can easily cause incomplete rankings
Risk of highly misleading results depending on how the metric deals with this
No
unrated
1
2
All
unrated
Metric cutoff
3
4
Coverage
loss
1
2
3
4
5
6 6
5 0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
1000
2000
Full
Coverage@10
0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
Full
Coverage@10
Yahoo! R3 MovieLens 1M
kNN with
small 𝑘 kNN with
small 𝑘
# unrated target items # unrated target items
0
1
2
5
10
20
50
100
200
500
All
0
1
2
5
10
20
50
100
200
500
All
1000
2000
0
1
0.8
0.6
0.4
0.2
Coverage@10
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Conclusion
 Different target sets produce different evaluation outcomes
– The disagreements are systematic on specific algorithms and configurations
 Weakness of small target sets
– More difficult to produce different rankings → discrimination power loss
– Incomplete rankings
 Weakness of large target sets
– Exposure to observation bias (popularity or any other MNAR bias)
– More difficult to produce metric values > 0 → discrimination power loss
 Tie analysis can provide helpful orientation
Neithersemsideal!
Sweetspot→balance
IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Future work
 Target items introduce a pre-filter that may alter the evaluated algorithms
– Different target sampling distributions (e.g. popularity)
– Different split protocols (e.g. temporal) also affect this
 Further research on offline evaluation bias
– Does unbiased Yahoo! R3 match a real setting?
 Also check out:
– Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020
– Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020

More Related Content

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

Opticon 2017 Decisions at Scale
Opticon 2017 Decisions at ScaleOpticon 2017 Decisions at Scale
Opticon 2017 Decisions at ScaleOptimizely
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsApigee | Google Cloud
 
TargetSummit Berlin - Lovoo Lele Canfora
TargetSummit Berlin -  Lovoo Lele CanforaTargetSummit Berlin -  Lovoo Lele Canfora
TargetSummit Berlin - Lovoo Lele CanforaTargetSummit
 
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...IRJET Journal
 
Artificial Intelligence in Action
Artificial Intelligence in ActionArtificial Intelligence in Action
Artificial Intelligence in ActionBenjamin Ejzenberg
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testingRatio
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3Ogilvy Consulting
 
Google Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQueryGoogle Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQueryCARTO
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of ProductProduct School
 
Building Analytics for Growth
Building Analytics for GrowthBuilding Analytics for Growth
Building Analytics for GrowthKareem Azees
 
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Applitools
 
Get Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With DigitalGet Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With DigitalJoshua Stauffer
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020Patrick Deglon
 
Digital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and BeyondDigital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and BeyondSCL HUB Conference
 
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - WebinarA/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - WebinarRoberto Turrin
 
Peak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 UpdatePeak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 UpdatePaul Drägert
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
 
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?Michaela Linhart
 

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation (20)

Opticon 2017 Decisions at Scale
Opticon 2017 Decisions at ScaleOpticon 2017 Decisions at Scale
Opticon 2017 Decisions at Scale
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
 
TargetSummit Berlin - Lovoo Lele Canfora
TargetSummit Berlin -  Lovoo Lele CanforaTargetSummit Berlin -  Lovoo Lele Canfora
TargetSummit Berlin - Lovoo Lele Canfora
 
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...IRJET-  	  Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
IRJET- Sentiment Analysis of Customer Reviews on Laptop Products for Flip...
 
Series A Deck
Series A DeckSeries A Deck
Series A Deck
 
Artificial Intelligence in Action
Artificial Intelligence in ActionArtificial Intelligence in Action
Artificial Intelligence in Action
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testing
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3
 
Google Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQueryGoogle Analytics location data visualised with CARTO & BigQuery
Google Analytics location data visualised with CARTO & BigQuery
 
Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
 
Building Analytics for Growth
Building Analytics for GrowthBuilding Analytics for Growth
Building Analytics for Growth
 
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
Wrong Tool, Wrong Time: Re-Thinking Test Automation -- w/ State of Visual Tes...
 
Get Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With DigitalGet Scrappy: Start Measuring Customer LTV With Digital
Get Scrappy: Start Measuring Customer LTV With Digital
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020
 
Digital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and BeyondDigital Decisioning for the New Decade - 2020 and Beyond
Digital Decisioning for the New Decade - 2020 and Beyond
 
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - WebinarA/B Testing Data-Driven Algorithms in the Cloud - Webinar
A/B Testing Data-Driven Algorithms in the Cloud - Webinar
 
Peak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 UpdatePeak Ace on Air #31 - Apple's iOS 14.5 Update
Peak Ace on Air #31 - Apple's iOS 14.5 Update
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
 
Projects
ProjectsProjects
Projects
 
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
 

More from Pablo Castells

REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...Pablo Castells
 
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...Pablo Castells
 
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...Pablo Castells
 
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsSIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsPablo Castells
 
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...Pablo Castells
 
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...Pablo Castells
 

More from Pablo Castells (6)

REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
REVEAL @ RecSys 2018 - Characterization of Fair Experiments for Recommender S...
 
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
SIGIR 2017 - A Probabilistic Reformulation of Memory-Based Collaborative Filt...
 
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
RSWeb @ ACM RecSys 2014 - Exploring social network effects on popularity bias...
 
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender SystemsSIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
SIGIR 2011 Poster - Intent-Oriented Diversity in Recommender Systems
 
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...SIGIR 2012 - Explicit Relevance Models in Intent-Oriented  Information Retrie...
SIGIR 2012 - Explicit Relevance Models in Intent-Oriented Information Retrie...
 
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Rec...
 

Recently uploaded

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

  • 1. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Rocío Cañamares and Pablo Castells Universidad Autónoma de Madrid http://ir.ii.uam.es Virtual Event, Brazil, September 2020
  • 2. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Offline evaluation Is system A better than B? A B
  • 3. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 ··· ··· Offline evaluation Rank Compute metrics Test data Training Unrated Test Set of all items Getting it right • Correlate with production setting / online evaluation • Consistent & comparable with other offline experiments A > B? A B Training data
  • 4. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Offline evaluation – Target items A • Exclude items with training data • Include a certain number of non- relevant items (e.g. to reduce cost) • Can this change the outcome? Set of all items B Compute metrics Test data A > B? Training Unrated Test Training data Target items
  • 5. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test Liked Not liked Training Unrated Target items
  • 6. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test Liked Not liked Unrated Test + all unrated All items except training items Largest Target items Test + no unrated Just test ratings Smallest Target items
  • 7. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Target items Target items Test + all unrated Target items Test + no unrated Test Liked Not liked Unrated Test + some unrated Target items May the number of unrated target items affect the evaluation outcome?
  • 8. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency Ranking A 1 2 3 4 5 All unrated Ranking B 1 2 3 4 5 P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5 Ranking A No unrated Ranking B May this affect the evaluation outcome? > < 1 2 3 1 2 3 4 5 4 5 Test Unrated Liked Not liked
  • 9. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency All unrated No unrated A simple offline experiment on MovieLens 1M 0 0.2 0.4 0.6 Full Test P@10P@10 0 0.6 0.4 0.2 8 systems iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average rating Popularity Random 0 0.2 0.4 0.6 Full Test P@10 0 0.2 0.4 0.6 Full Test P@10
  • 10. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency All unrated No unrated A simple offline experiment on MovieLens 1M 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8 iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average ratingMost popular Random Average rating Random Normalized kNN (test) kNN (full/test) Normalized kNN (full) iMF (full) Most popular iMF (test) Best system Worst system P@10
  • 11. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Result inconsistency 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8 iMF (full) iMF (test) kNN (full/test) Normalized kNN (full) Normalized kNN (test) Average ratingMost popular Random Average rating Random Normalized kNN (test) kNN (full/test) Normalized kNN (full) iMF (full) Most popular iMF (test) Kendall 𝛕 = 𝟎. 𝟏𝟒 A simple offline experiment on MovieLens 1M Best No unrated Worst All unrated
  • 12. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 • Popular items • Unrated items in objective function • Items with high average rating • Ignoring unrated items in objective Biased disagreement Systematic disagreements Many Few Unrated target items
  • 13. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Biased disagreement Which one is right? • Popular items • Unrated items in objective function • Items with high average rating • Ignoring unrated items in objective Many Few Unrated target items
  • 14. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Biased disagreement – Is either one right? Since we want to match online evaluation, let’s compare to unbiased evaluation Few unrated items Many unrated items Unbiased evaluation Which one is right? Yahoo! R3 MAR ratings → Unbiased evaluation MNAR ratings → Biased evaluation
  • 15. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Comparison to unbiased evaluation Biased vs. unbiased evaluation with Yahoo! R3 Neither all nor zero unrated targets match unbiased evaluation well How about something in between? Let’s explore the target size range… τ = 0.79 τ = 0.57 No unrated All unrated Unbiased Systemranking 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8
  • 16. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Correlation Yahoo! R3 # unrated target items “Sweet spot” Kendall  0 1 2 5 10 20 50 100 200 500 All 0 1 0.8 0.6 0.4 0.2 τ = 0.79 τ = 0.57 No unrated All unrated Unbiased Systemranking 6 1 2 3 4 5 7 8 6 1 2 3 4 5 7 8
  • 17. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Correlation Yahoo! R3 # unrated target items MovieLens 1M # unrated target items “Sweet spot” ? No MAR data Check discriminative power Kendall  1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 All 0 0 All
  • 18. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: ties Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items Almost opposite monotonicity # unrated target items Sweet spot? # ties # ties Check discriminative power Kendall  1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 All 0 0 All
  • 19. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 1000 2000 Why do ties increase in the extremes? • Few unrated items: small set of items to rank • Many unrated items: metric → 0 as # unrated →  0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: ties Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items Almost opposite monotonicity # unrated target items  Sweet spot? # ties # ties Kendall  Many ties 1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 All 0 0 All
  • 20. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 1 2 5 10 20 50 100 200 500 All 1000 2000 0 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 Full 500 200 100 50 20 10 5 2 1 0 0 20 40 60 80 0 0.2 0.4 0.6 Full 2000 1000 500 200 100 50 20 10 5 2 1 0 Comparison to unbiased evaluation – Discriminative power: 𝑝-values Yahoo! R3 “Sweet spot” MovieLens 1M # unrated target items The number of ties seems more informative than 𝑝-values # unrated target items Sweet spot? Sum of 𝑝-values # ties Kendall  # ties Sum of 𝑝-values 1 2 5 10 20 50 100 200 500 0 1 0.8 0.6 0.4 0.2 0 0.6 0.4 0.2 0 40 20 0 100 300 0 All
  • 21. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Loss of coverage Small target sets can easily cause incomplete rankings Risk of highly misleading results depending on how the metric deals with this No unrated 1 2 All unrated Metric cutoff 3 4 Coverage loss 1 2 3 4 5 6 6 5 0 0.2 0.4 0.6 0.8 1 0 1 2 5 10 20 50 100 200 500 1000 2000 Full Coverage@10 0 0.2 0.4 0.6 0.8 1 0 1 2 5 10 20 50 100 200 500 Full Coverage@10 Yahoo! R3 MovieLens 1M kNN with small 𝑘 kNN with small 𝑘 # unrated target items # unrated target items 0 1 2 5 10 20 50 100 200 500 All 0 1 2 5 10 20 50 100 200 500 All 1000 2000 0 1 0.8 0.6 0.4 0.2 Coverage@10
  • 22. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Conclusion  Different target sets produce different evaluation outcomes – The disagreements are systematic on specific algorithms and configurations  Weakness of small target sets – More difficult to produce different rankings → discrimination power loss – Incomplete rankings  Weakness of large target sets – Exposure to observation bias (popularity or any other MNAR bias) – More difficult to produce metric values > 0 → discrimination power loss  Tie analysis can provide helpful orientation Neithersemsideal! Sweetspot→balance
  • 23. IRGIRGroup @UAM On Target Item Sampling in Offline Recommender System Evaluation 14th ACM Conference on Recommender Systems (RecSys 2020) Virtual Event, Brazil, September 2020 Future work  Target items introduce a pre-filter that may alter the evaluated algorithms – Different target sampling distributions (e.g. popularity) – Different split protocols (e.g. temporal) also affect this  Further research on offline evaluation bias – Does unbiased Yahoo! R3 match a real setting?  Also check out: – Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020 – Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020