RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
On Target Item Sampling
in Offline Recommender System Evaluation
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es

IRGIRGroup @UAM
Offline evaluation
Is system A better than B?
A B

IRGIRGroup @UAM
···
···
Offline evaluation
Rank
Compute
metrics
Test data
Training
Unrated
Test
Set of
all items
Getting it right
• Correlate with production setting / online evaluation
• Consistent & comparable with other offline experiments
A > B?
A B
Training
data

IRGIRGroup @UAM
Offline evaluation – Target items
A
• Exclude items with training data
• Include a certain number of non-
relevant items (e.g. to reduce cost)
• Can this change the outcome?
Set of
all items
B
Compute
metrics
Test data
A > B?
Training
Unrated
Test
Training
data
Target
items

IRGIRGroup @UAM
Target items
Target items
Test
Liked
Not liked
Training
Unrated
Target
items

IRGIRGroup @UAM
Target items
Target items
Test
Liked
Not liked
Unrated
Test + all unrated
All items except
training items
Largest
Target items
Test + no unrated
Just test ratings
Smallest
Target
items

IRGIRGroup @UAM
Target items
Target items
Test + all unrated
Target items
Test + no unrated
Test
Liked
Not liked
Unrated
Test + some unrated
Target items
May the number of unrated target items
affect the evaluation outcome?

IRGIRGroup @UAM
Result inconsistency
Ranking A
1
2
3
4
5
All unrated
Ranking B
1
2
3
4
5
P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5
Ranking A
No unrated
Ranking B
May this affect the evaluation outcome?
> <
1
2
3
1
2
3
4
5
4
5
Test
Unrated
Liked
Not liked

IRGIRGroup @UAM
All unrated No unrated
A simple offline experiment on MovieLens 1M
0
0.2
0.4
0.6
Full Test
P@10P@10
0
0.6
0.4
0.2
8 systems
iMF (full)
iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average rating
Popularity
Random
0
0.2
0.4
0.6
Full Test
P@10
0
0.2
0.4
0.6
Full Test
P@10

IRGIRGroup @UAM
All unrated No unrated
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Average ratingMost popular
Random
Average rating
Random
kNN (full/test)
iMF (full)
Most popular
iMF (test)
Best
system
Worst
system
P@10

IRGIRGroup @UAM
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Average ratingMost popular
Random
Average rating
Random
kNN (full/test)
iMF (full)
Most popular
iMF (test)
Kendall 𝛕 = 𝟎. 𝟏𝟒
Best
No unrated
Worst
All unrated

IRGIRGroup @UAM
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Biased disagreement
Systematic disagreements Many Few
Unrated target items

IRGIRGroup @UAM
Biased disagreement
Which one is right?
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Many Few
Unrated target items

IRGIRGroup @UAM
Biased disagreement – Is either one right?
Since we want to match
online evaluation,
let’s compare to
unbiased evaluation
Few
unrated items
Many
unrated items
Unbiased
evaluation
Which one is right?
Yahoo! R3
MAR ratings → Unbiased evaluation
MNAR ratings → Biased evaluation

IRGIRGroup @UAM
Comparison to unbiased evaluation
Biased vs. unbiased evaluation
with Yahoo! R3
Neither all nor zero unrated targets
match unbiased evaluation well
How about something in between?
Let’s explore the target size range…
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8

IRGIRGroup @UAM
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
“Sweet spot” Kendall 
0
1
2
5
10
20
50
100
200
500
All
0
1
0.8
0.6
0.4
0.2
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8

IRGIRGroup @UAM
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
MovieLens 1M
“Sweet spot”
?
No MAR data
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
All
0
0
All

IRGIRGroup @UAM
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
Almost opposite
monotonicity
Sweet spot?
# ties
# ties
Check discriminative power
Kendall 
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All

IRGIRGroup @UAM
1
2
5
10
20
50
100
200
500
1000
2000
Why do ties increase in the extremes?
• Few unrated items: small set of items to rank
• Many unrated items: metric → 0 as # unrated → 
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
Almost opposite
monotonicity

Sweet spot?
# ties
# ties
Kendall 
Many ties
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All

IRGIRGroup @UAM
1
2
5
10
20
50
100
200
500
All
1000
2000
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: 𝑝-values
Yahoo! R3
“Sweet spot”
MovieLens 1M
The number of ties seems more informative than 𝑝-values
Sweet spot?
Sum of
𝑝-values
# ties
Kendall 
# ties
Sum of
𝑝-values
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
0
40
20
0
100
300
0
All

IRGIRGroup @UAM
Loss of coverage
Small target sets can easily cause incomplete rankings
Risk of highly misleading results depending on how the metric deals with this
No
unrated
1
2
All
unrated
Metric cutoff
3
4
Coverage
loss
1
2
3
4
5
6 6
5 0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
1000
2000
Full
Coverage@10
0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
Full
Coverage@10
Yahoo! R3 MovieLens 1M
kNN with
small 𝑘 kNN with
small 𝑘
# unrated target items # unrated target items
0
1
2
5
10
20
50
100
200
500
All
0
1
2
5
10
20
50
100
200
500
All
1000
2000
0
1
0.8
0.6
0.4
0.2
Coverage@10

IRGIRGroup @UAM
Conclusion
 Different target sets produce different evaluation outcomes
– The disagreements are systematic on specific algorithms and configurations
 Weakness of small target sets
– More difficult to produce different rankings → discrimination power loss
– Incomplete rankings
 Weakness of large target sets
– Exposure to observation bias (popularity or any other MNAR bias)
– More difficult to produce metric values > 0 → discrimination power loss
 Tie analysis can provide helpful orientation
Neithersemsideal!
Sweetspot→balance

IRGIRGroup @UAM
Future work
 Target items introduce a pre-filter that may alter the evaluated algorithms
– Different target sampling distributions (e.g. popularity)
– Different split protocols (e.g. temporal) also affect this
 Further research on offline evaluation bias
– Does unbiased Yahoo! R3 match a real setting?
 Also check out:
– Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020
– Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020

RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

Recommended

Recommended

More Related Content

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation

Similar to RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation (20)

More from Pablo Castells

More from Pablo Castells (6)

Recently uploaded

Recently uploaded (20)

RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation