9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
RecSys 2020 - On Target Item Sampling in Offline Recommender System Evaluation
1. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
On Target Item Sampling
in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Rocío Cañamares and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
Virtual Event, Brazil, September 2020
2. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation
Is system A better than B?
A B
3. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
···
···
Offline evaluation
Rank
Compute
metrics
Test data
Training
Unrated
Test
Set of
all items
Getting it right
• Correlate with production setting / online evaluation
• Consistent & comparable with other offline experiments
A > B?
A B
Training
data
4. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Offline evaluation – Target items
A
• Exclude items with training data
• Include a certain number of non-
relevant items (e.g. to reduce cost)
• Can this change the outcome?
Set of
all items
B
Compute
metrics
Test data
A > B?
Training
Unrated
Test
Training
data
Target
items
5. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Training
Unrated
Target
items
6. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test
Liked
Not liked
Unrated
Test + all unrated
All items except
training items
Largest
Target items
Test + no unrated
Just test ratings
Smallest
Target
items
7. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Target items
Target items
Test + all unrated
Target items
Test + no unrated
Test
Liked
Not liked
Unrated
Test + some unrated
Target items
May the number of unrated target items
affect the evaluation outcome?
8. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
Ranking A
1
2
3
4
5
All unrated
Ranking B
1
2
3
4
5
P@2 = 0P@2 = 0.5 P@2 = 1P@2 = 0.5
Ranking A
No unrated
Ranking B
May this affect the evaluation outcome?
> <
1
2
3
1
2
3
4
5
4
5
Test
Unrated
Liked
Not liked
9. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
0
0.2
0.4
0.6
Full Test
P@10P@10
0
0.6
0.4
0.2
8 systems
iMF (full)
iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average rating
Popularity
Random
0
0.2
0.4
0.6
Full Test
P@10
0
0.2
0.4
0.6
Full Test
P@10
10. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
All unrated No unrated
A simple offline experiment on MovieLens 1M
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Best
system
Worst
system
P@10
11. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Result inconsistency
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
iMF (full) iMF (test)
kNN (full/test)
Normalized kNN (full)
Normalized kNN (test)
Average ratingMost popular
Random
Average rating
Random
Normalized kNN (test)
kNN (full/test)
Normalized kNN (full)
iMF (full)
Most popular
iMF (test)
Kendall 𝛕 = 𝟎. 𝟏𝟒
A simple offline experiment on MovieLens 1M
Best
No unrated
Worst
All unrated
12. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Biased disagreement
Systematic disagreements Many Few
Unrated target items
13. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement
Which one is right?
• Popular items
• Unrated items in objective function
• Items with high average rating
• Ignoring unrated items in objective
Many Few
Unrated target items
14. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Biased disagreement – Is either one right?
Since we want to match
online evaluation,
let’s compare to
unbiased evaluation
Few
unrated items
Many
unrated items
Unbiased
evaluation
Which one is right?
Yahoo! R3
MAR ratings → Unbiased evaluation
MNAR ratings → Biased evaluation
15. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Comparison to unbiased evaluation
Biased vs. unbiased evaluation
with Yahoo! R3
Neither all nor zero unrated targets
match unbiased evaluation well
How about something in between?
Let’s explore the target size range…
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
16. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
“Sweet spot” Kendall
0
1
2
5
10
20
50
100
200
500
All
0
1
0.8
0.6
0.4
0.2
τ = 0.79 τ = 0.57
No
unrated
All
unrated
Unbiased
Systemranking
6
1
2
3
4
5
7
8
6
1
2
3
4
5
7
8
17. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Correlation
Yahoo! R3
# unrated target items
MovieLens 1M
# unrated target items
“Sweet spot”
?
No MAR data
Check discriminative power
Kendall
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
All
0
0
All
18. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Check discriminative power
Kendall
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
19. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
1000
2000
Why do ties increase in the extremes?
• Few unrated items: small set of items to rank
• Many unrated items: metric → 0 as # unrated →
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: ties
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
Almost opposite
monotonicity
# unrated target items
Sweet spot?
# ties
# ties
Kendall
Many ties
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
All
0
0
All
20. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
1
2
5
10
20
50
100
200
500
All
1000
2000
0
0
100
200
300
400
0
0.2
0.4
0.6
0.8
1
Full
500
200
100
50
20
10
5
2
1
0
0
20
40
60
80
0
0.2
0.4
0.6
Full
2000
1000
500
200
100
50
20
10
5
2
1
0
Comparison to unbiased evaluation – Discriminative power: 𝑝-values
Yahoo! R3
“Sweet spot”
MovieLens 1M
# unrated target items
The number of ties seems more informative than 𝑝-values
# unrated target items
Sweet spot?
Sum of
𝑝-values
# ties
Kendall
# ties
Sum of
𝑝-values
1
2
5
10
20
50
100
200
500
0
1
0.8
0.6
0.4
0.2
0
0.6
0.4
0.2
0
40
20
0
100
300
0
All
21. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Loss of coverage
Small target sets can easily cause incomplete rankings
Risk of highly misleading results depending on how the metric deals with this
No
unrated
1
2
All
unrated
Metric cutoff
3
4
Coverage
loss
1
2
3
4
5
6 6
5 0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
1000
2000
Full
Coverage@10
0
0.2
0.4
0.6
0.8
1
0
1
2
5
10
20
50
100
200
500
Full
Coverage@10
Yahoo! R3 MovieLens 1M
kNN with
small 𝑘 kNN with
small 𝑘
# unrated target items # unrated target items
0
1
2
5
10
20
50
100
200
500
All
0
1
2
5
10
20
50
100
200
500
All
1000
2000
0
1
0.8
0.6
0.4
0.2
Coverage@10
22. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Conclusion
Different target sets produce different evaluation outcomes
– The disagreements are systematic on specific algorithms and configurations
Weakness of small target sets
– More difficult to produce different rankings → discrimination power loss
– Incomplete rankings
Weakness of large target sets
– Exposure to observation bias (popularity or any other MNAR bias)
– More difficult to produce metric values > 0 → discrimination power loss
Tie analysis can provide helpful orientation
Neithersemsideal!
Sweetspot→balance
23. IRGIRGroup @UAM
On Target Item Sampling in Offline Recommender System Evaluation
14th ACM Conference on Recommender Systems (RecSys 2020)
Virtual Event, Brazil, September 2020
Future work
Target items introduce a pre-filter that may alter the evaluated algorithms
– Different target sampling distributions (e.g. popularity)
– Different split protocols (e.g. temporal) also affect this
Further research on offline evaluation bias
– Does unbiased Yahoo! R3 match a real setting?
Also check out:
– Krichene & Rendle, On Sampled Metrics for Item Recommendation, KDD 2020
– Li et al., On Sampling Top-K Recommendation Evaluation, KDD 2020