Slides of the paper presentation at RecSys 2011.
Abstract: The Recommender Systems community is paying increasing attention to novelty and diversity as key qualities beyond accuracy in real recommendation scenarios. Despite the raise of interest and work on the topic in recent years, we find that a clear common methodological and conceptual ground for the evaluation of these dimensions is still to be consolidated. Different evaluation metrics have been reported in the literature but the precise relation, distinction or equivalence between them has not been explicitly studied. Furthermore, the metrics reported so far miss important properties such as taking into consideration the ranking of recommended items, or whether items are relevant or not, when assessing the novelty and diversity of recommendations.
We present a formal framework for the definition of novelty and diversity metrics that unifies and generalizes several state of the art metrics. We identify three essential ground concepts at the roots of novelty and diversity: choice, discovery and relevance, upon which the framework is built. Item rank and relevance are introduced through a probabilistic recommendation browsing model, building upon the same three basic concepts. Based on the combination of ground elements, and the assumptions of the browsing model, different metrics and variants unfold. We report experimental observations which validate and illustrate the properties of the proposed metrics.
ACM RecSys 2011 - Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
1. 5th ACM International Conference on
Recommender Systems – RecSys 2011
Rank and Relevance
in Novelty and Diversity Metrics
for Recommender Systems
Saúl Vargas and Pablo Castells
Universidad Autónoma de Madrid
http://ir.ii.uam.es
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
2. Beyond accuracy: novelty and diversity
You bought So you are recommended…
(or browsed)
Revolver
Rubber Soul With The Beatles Let it be Help!
Beatles for Sale
A Hard Day’s Sgt. Pp’s Lonely Yellow Magical Mystery The White
Night Hearts Club Band Submarine Tour Album
Abbey Road
The recommendedPlease are…
items 1967-1970 1962-1966 Past Masters Past Masters
Very similar to each other (Blue)
Please me (Red) Vol 2
Very similar to what the
user has already seen … More Beatles’
albums
Very widely known
Dark Side Some Girls Bob Dylan
of the Moon
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
3. Novelty and diversity in Recommender Systems
Algorithms to enhance novelty and diversity
Greedy optimization of objective functions (accuracy + diversity), promotion of long-tail items, etc.
(Ziegler 2005, Zhang 2008, Celma 2008)
Metrics and methodologies to measure and evaluate novelty and diversity
Inverse popularity –mean self-information (Zhou 2010) recommend in the long tail
1
MSI
R
log
iR
2 p i Novelty
Intra-list diversity –average pairwise distance (Ziegler 2005, Zhang 2008)
2
ILD R d ik , il
R R 1 ik ,il Diversity
k l
Other: temporal diversity (Lathia 2010), diversity relative to other users & to other systems
(Bellogín 2010), aggregate diversity (Adomavicius 2011), unexpectedness (Adamopoulos 2011), etc.
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
4. Some limitations
R1 R2
Metrics are insensitive to the
Diverse
Not diverse
order of recommender items
Same item sets same
measured diversity/novelty
Not diverse
Diverse
…
…
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
5. Some limitations
Accuracy and diversity/novelty measured independently
Method A is better than B Which one is better?
Method A
Method B
Accuracy
Diversity
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
6. Our research goals
1. Further formalize recommendation novelty and diversity
metrics based on a few basic fundamental principles
2. Build a unified metric framework where:
– As many state of the art novelty and diversity metrics as possible
are related and generalized
– New metrics can be defined
3. Enhance the novelty and diversity metrics with rank
sensitivity and relevance awareness
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
7. Basic fundamental principles to build metrics upon
Our approach: define and formalize novelty and diversity metrics
based on models of how users interact with items
Three basic fundamental principles in user-item interaction
– Discovery – an item is seen by a user
– Relevance – an item would be liked by (or useful for, etc.) a user
– Choice – an item is actually accepted (bought, consumed, etc.) by a user
Formalized as binary random variables
– seen, rel, choose taking values in {true, false}
seen choose rel
Simplifying assumptions:
– seen and rel are mutually independent
– If a user sees an item that is relevant for her, p choose p seen p rel
she chooses it
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
8. Proposed metric framework
Expected effective novelty of items when a user interacts
R
with a ranked list of recommended items in a context
m R C p choose i, u, R nov i
iR
Novelty is relative: item novelty context
i
To (what we know about) what someone has seen sometime somewhere
Someone the target user, a set of users, all users…
Sometime a specific past time period, an ongoing session, “ever”…
Somewhere past recommendations, the current recommendation R,
recommendations by other systems, “anywhere”…
“What we know about that” context of observation: available observations
…
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
9. Metric framework components
m R C p choose i, u, R nov i
iR
Item novelty model nov i
Choice model p choose i, u, R
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
10. Item novelty models
Item novelty model nov(i|)
Discovery-based (negative popularity)
– Popularity complement nov i 1 p seen i, Forced discovery
– Self-information (surprisal) nov i log2 p i seen, Free discovery
Distance-based ( here represents a set of items)
– Expected item distance nov i p j choose, i, d i, j
j
– Minimum item distance nov i min d i, j
j
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
11. Metric framework components
m R C p choose i, u, R nov i
iR
Item novelty model nov i
Choice model p choose i, u, R
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
12. Choice model
Choice model p(choose|i,u,R)
p choose p seen p rel
p choose i, u, R p seen i, u, R p rel i, u
Browsing Relevance Independent
model model from R
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
13. Browsing model
R
Browsing model where p(seen|ik,u,R) should decrease with k
1
Can be formalized as different probabilistic discount functions
2
(see e.g. Carterette 2011)
3
In general, p(seen|ik,u,R) = disc(k)
4
disc k
5
p k 1 exponential, as in RBP (Moffat 2008)
k=6 ?
1 log k 1 as in nDCG
7
1k Zipfian, as in MRR, MAP, etc.
8
1 no discount
9
... many others...
…
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
14. Wrapping up: resulting metric scheme
m R C disc k p rel ik , u nov ik
ik R
Rank Item Item
discount relevance novelty
Normalization – to get the novelty ratio by expected number of browsed items
1
C
disc k
ik R
Expected browsing depth
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
15. Implementation
Ground model estimates
observed interaction between all users and items in the system
Discovery distributions can be estimated from rating data or access records
– Forced discovery p(seen|i,) IUF (ratio of users who have interacted with i)
– Free discovery: p(i|seen,) ICF (ratio of interactions involving i)
Relevance distribution p(rel|i,u) is estimated by a mapping from ratings
to relevance (see definition of ERR in Chapelle 2009)
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
16. Novelty and diversity metrics
Putting all together
Some metric framework instantiations
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
17. Putting all together: metric framework instantiations
Discovery-based metrics
observed interaction between all users and items in the system
Expected popularity complement
EPC R C disc k p rel ik , u 1 p seen ik
ik R
Novelty
Expected free discovery
EFD R C disc k p rel ik , u log p ik seen
ik R
1
Without rank and relevance reduces to MSI R
R
log p i seen
iR
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
18. Putting all together: metric framework instantiations
Distance-based metrics
the observed interaction of the target user only
Expected profile distance
Unexpectedness
EPD R C u disc k p rel ik , u p rel j, u d ik , j (user-specific)
ik R
j u
the recommended items the target user can see in R
Expected intra-list diversity
Diversity
EILD R C disc k disc l k p rel i , u p rel i , u d i , i
ik R
k k l k l
il R
k l 2
Without rank and relevance reduces to
ILD R d ik , il
R R 1 ik ,il R
k l
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
19. Novelty and diversity metrics
Some experiments
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
20. Experiments
Datasets Recommender algorithms
– MovieLens 1M – CB Content-based (ML only)
– Last.fm data by Òscar Celma – UB User-based kNN
Experiment design – MF Matrix factorization
– Run baseline recommenders – AVG Average rating
– Rerank top 500 recommended items – RND Random
by diversification algorithms
– Measure metrics on top 50 items Diversification algorithms
Metrics – MMR Greedy optimization
of relevance + diversity
– EPC@50 Novelty (Zhang 2008)
(popularity complement)
– IA-Select Adaptation of IR
– EPD@50 Unexpectednes diversity algorithm
(profile distance) (Agrawal 2008)
– EILD@50 Intra-list diversity – NGD Greedy optimization
Distance function: complement of Jaccard of relevance + novelty
(MovieLens genres) and Pearson (Last.fm) – Random
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
21. Experimental results on baseline recommenders (no rank discount)
MovieLens 1M Last.fm
Without relevance
1.0 1.00 CB
0.9 MF
CB is good for long-
0.97
No relevance
0.8 UB tail, not so good at
0.94
0.7 AVG unexpectedness
0.91 RND
0.6 and diversity
0.5 0.88
AVG rating and RND
0.4 0.85
EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50 stand out, especial-
ly on Last.fm
MovieLens 1M Last.fm
With relevance
0.07 0.5 CB
MF stands out on
Relevance-aware
0.06 MF
0.4
0.05 UB MovieLens
0.04 0.3 AVG
0.03 RND
UB stands out on
0.2
0.02 Last.fm
0.01 0.1
0.00 0.0
AVG rating and RND
EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50 drop drastically
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
22. Experimental results with diversification algorithms
Wilcoxon MovieLens 1M Last.fm
p < 0.001 EPC@50 EPD@50 EILD@50 EPC@50 EPD@50 EILD@50
disc (k) 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1 1 0.85k–1
MF 0.9124 0.8876 0.7632 0.7466 0.7164 0.6191 0.8754 0.8481 0.8949 0.8895 0.8862 0.7954
No relevance
IA-Select 0.9045 0.8886 0.8080 0.7577 0.8289 0.7483 0.8840 0.9089 0.8912 0.8909 0.8878 0.8274
MMR 0.9063 0.8769 0.7605 0.7428 0.7191 0.6247 0.9068 0.8903 0.9133 0.9107 0.9166 0.8398
NGD 0.9851 0.9795 0.7725 0.7551 0.6563 0.5430 0.9722 0.9571 0.9423 0.9398 0.9485 0.8784
Random 0.9525 0.9527 0.7699 0.7699 0.7283 0.6719 0.9359 0.9357 0.9278 0.9279 0.9318 0.8619
MF 0.0671 0.1043 0.0580 0.0944 0.0471 0.0551 0.2501 0.2115 0.2671 0.2587 0.2518 0.1900
IA-Select
Relevance
0.0705 0.1161 0.0639 0.1032 0.0537 0.0648 0.3343 0.4752 0.3462 0.3994 0.3343 0.4154
MMR 0.0719 0.1131 0.0620 0.1020 0.0510 0.0610 0.2351 0.1936 0.2439 0.2340 0.2360 0.1759
NGD 0.0155 0.0223 0.0128 0.0200 0.0067 0.0017 0.2286 0.3077 0.2212 0.2593 0.2165 0.2656
Random 0.0222 0.0218 0.0182 0.0179 0.0117 0.0058 0.1362 0.1368 0.1407 0.1405 0.1342 0.1113
Improvement w.r.t. random reranking is clearer with relevance best
Rank sensitivity uncovers further improvements by diversification algorithms > random
Different metrics appreciate different diversification algorithms consistently < baseline
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
23. Experimental results
The metrics behave consistently
– E.g. content-based recommender scores high on novelty (long-tail) but low on
unexpectedness and diversity
– Diversified recommendations score higher than baselines
– Different diversification strategies met their specific target
Relevance makes a large difference
– Probe recommenders such as random and average rating score high without
relevance and rank discount –and they drop with relevance
– Same effect for random diversification
Rank sensitiveness uncovers further improvements by diversification
algorithms which otherwise go unnoticed
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011
24. Conclusion
General metric framework for recommendation novelty and diversity evaluation
Flexible and configurable, supports a fair range of variants and configurations
– Key configuration components: item novelty models, context , rank and relevance
Unifies and generalizes state of the art metrics
– Further metrics can be unified taking alternative : temporal novelty/diversity,
inter-system diversity, inter-user diversity
Provides for rank sensitivity and relevance awareness (as an option)
Provides for single metric assessing accuracy and diversity/novelty
Further ongoing empirical testing, wide space for further exploration!
IRG
Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems
5th ACM International Conference on Recommender Systems (RecSys 2011)
IR Group @ UAM Chicago, IL, 23-27 October 2011