The Science and the Magic of User Feedback for Recommender Systems

The Science and the Magic
of User Feedback
for Recommender Systems

Xavier Amatriain

Bay Area, March '11

But first...

About Telefonica and Telefonica R&D

Telefonica is a fast-growing Telecom

1989 2000 2008
Clients About 12 About 68 About 260
million million million
subscribers customers customers
Services Basic Wireline and mobile Integrated ICT
telephone and voice, data and solutions for all
data services Internet services customers
Geographies
Operations in Operations in
Spain 25 countries
16 countries

Staff
About 71,000 About 149,000 About 257,000
professionals professionals professionals

Finances Rev: 4,273 M€ Rev: 28,485 M€ Rev: 57,946 M€
EPS(1): 0.45 € EPS(1): 0.67 € EPS: 1.63 €
(1) EPS: Earnings per share

Currently among the largest in the world
Telco sector worldwide ranking by market cap (US$ bn)

Source: Bloomberg, 06/12/09

Just announced 2010 results: record net earnings,
first Spanish company ever to make > 10B €

Leader in South America

Data as of March ‘09

1 2 Argentina: 20.9 million Wireline market rank
2 1 Brazil: 61.4 million Mobile market rank
2 Central America: 6.1 million
1 2 Colombia: 12.6 million
1 1 Chile: 10.1 million
2 Ecuador: 3.3 million
2 Mexico: 15.7 million
1 1 Peru: 15.2 million
1
Uruguay: 1.5 million
2 Venezuela: 12.0 million
Total Accesses (as of March ‘09)
159.5 million

Notes:
- Central America includes Guatemala, Panama, El Salvador and Nicaragua
- Total accesses figure includes Narrowband Internet accesses of Terra Brasil and Terra Colombia, and Broadband
Internet accesses of Terra Brasil, Telefónica de Argentina, Terra Guatemala and Terra México.

And a significant footprint in Europe

Wireline market rank
Mobile market rank
Data as of March ‘09

1 1 Spain: 47.2 million
1 UK: 20.8 million
4 Germany: 16.0 million
2
Ireland: 1.7 million
Czech Republic: 7.7 million
1 2
Slovakia: 0.4 million
3

Total Accesses (as of March ’09)
93.8 million

Scientific Research
Mobile and Ubicomp
Multimedia Core User Modelling &
Data Mining

HCIR

DATA MINING

Wireless Systems
Content Distribution & P2P
Social Networks

Projects
Recommendation
Algorithms
Tourist routes

Social
contacts Music

User Analysis Movies
& Modeling Contextual
The Wisdom of the
Noise in users’ Few
Mobile
ratings
Tourist
behavior Microprofiles
Implicit user
feedback
Multiverse
Tensor
IPTV viewing habits
Factorization

And about the world we live in...

More is Less
W
or
se
D
ec
is
ns
io
io

ns
is
ec
D
s
es
L

Analysis Paralysis is making
headlines

Search engines don’t always hold the answer

What about information to help take decisions?

The Age of Search has come to
an end

... long live the Age of Recommendation!
●

●
Chris Anderson in “The Long Tail”
●
“We are leaving the age of information and entering the age of
recommendation”
●
CNN Money, “The race to create a 'smart' Google”:
●
“The Web, they say, is leaving the era of search and entering
one of discovery. What's the difference? Search is what you do
when you're looking for something. Discovery is when
something wonderful that you didn't know existed, or didn't
know how to ask for, finds you.”

Recommender
Systems
Recommendations

Read this

Attend this conference

Data mining +
all those other things
● User Interface
● User modeling
● System requirements (efficiency, scalability,
privacy....)
● Business Logic
● Serendipity
● ....

Approaches to
Recommendation
Collaborative Filtering
●

●
Recommend items based only on the users past behavior

Content-based
●

●
Recommend based on features inherent to the items

Social recommendations (trust-based)
●

What works

●
It depends on the domain and particular problem
●
As a general rule, it is usually a good idea to combine:
Hybrid Recommender Systems
●
However, in the general case it has been
demonstrated that (currently) the best isolated
approach is CF.
●
Item-based in general more efficient and better but
mixing CF approaches can improve result
●
Other approaches can improve results in specific
cases (cold-start problem...)

The CF Ingredients

● List of m Users and a list of n Items
● Each user has a list of items with associated opinion

● Explicit opinion - a rating score (numerical scale)

● Implicit feedback – purchase records or listening

history
● Active user for whom the prediction task is performed

● A metric for measuring similarity between users

● A method for selecting a subset of neighbors

● A method for predicting a rating for items not rated by

the active user.

24

The Netflix Prize

● 500K users x 17K movie
titles = 100M ratings = $1M
(if you “only” improve
existing system by 10%!
From 0.95 to 0.85 RMSE)
● 49K contestants on 40K teams from
184 countries.
● 41K valid submissions from 5K
teams; 64 submissions per day
● Wining approach uses hundreds of
predictors from several teams

User Feedback is Noisy

DID YOU HEAR WHAT
I LIKE??!!

...and limits Our Prediction
Accuracy

The Magic Barrier

● Magic Barrier = Limit on prediction accuracy
due to noise in original data
● Natural Noise = involuntary noise introduced by
users when giving feedback
● Due to (a) mistakes, and (b) lack of resolution in
personal rating scale
● Magic Barrier >= Natural Noise Threshold
● Our prediction error cannot be smaller than the
error in the original data

Our related research questions
X. Amatriain, J.M. Pujol, N. Oliver (2009) "I like It... I like It Not: Measuring Users
Ratings Noise in Recommender Systems", in UMAP 09

● Q1. Are users inconsistent when providing
explicit feedback to Recommender Systems via
the common Rating procedure?
● Q2. How large is the prediction error due to
these inconsistencies?
● Q3. What factors affect user inconsistencies?

Experimental Setup

● 100 Movies selected from Netflix dataset doing
a stratified random sampling on popularity
● Ratings on a 1 to 5 star scale
● Special “not seen” symbol.
● Trial 1 and 3 = random order; trial 2 = ordered
by popularity


● Users are inconsistent
● Inconsistencies are not
random and depend on
many factors


many factors
● More inconsistencies for mild
opinions


many factors
● More inconsistencies for mild
opinions
● More inconsistencies for
negative opinions

User’s ratings are far from
ground truth
#Ti #Tj # RMSE

   
T1, T2 2185 1961 1838 2308 0.573 0.707

T1, T3 2185 1909 1774 2320 0.637 0.765

T2, T3 1969 1909 1730 2140 0.557 0.694

Pairwise comparison between trials, RMSE is already > 0.55 or > 0.69 (Netflix Prize
was to get below 0.85 !!!)

Algorithm Robustness to NN
Trial 2 is
Alg./Trial
consistently the <T1 T2 T3 Tworst /Tbest
least noisy
User 1.2011 1.1469 1.1945 4.7%
Average
Item 1.0555 1.0361 1.0776 4%
Average
Userbased 0.9990 0.9640 1.0171 5.5%
kNN
Itembased 1.0429 1.0031 1.0417 4%
kNN
SVD 1.0244 0.9861 1.0285 4.3%

RMSE for different Recommendation algorithms
●

when predicting each of the trials

Rate it Again
X. Amatriain et al. (2009)"Rate it Again: Increasing Recommendation
Accuracy by User re-Rating", 2009 ACM RecSys
●
Given that users are noisy… can we benefit from
asking to rate the same movie more than once?

●
We propose an algorithm to allow for multiple ratings of
the same <user,item> tuple.
●
The algorithm is subjected to two fairness conditions:
– Algorithm should remove as few ratings as possible (i.e.
only when there is some certainty that the rating is only
adding noise)
– Algorithm should not make up new ratings but decide on
which of the existing ones are valid.

Re-rating Algorithm
• One source rerating case:

Examples:
{3, 1} →Ø
{4} →4
{3, 4} →3

(2 source)
{3, 4, 5} →3

• Given the following milding function:

Results

● One-source re-rating (Denoised⊚Denoising)
T1⊚T2 ΔT1 T1⊚T3 ΔT1 T2⊚T3 ΔT2
Userbased kNN 0.8861 11.3% 0.8960 10.3% 0.8984 6.8%

SVD 0.9121 11.0% 0.9274 9.5% 0.9159 7.1%

● Two-source re-rating (Denoising T1with the other 2)
Datasets T1⊚(T2, T3) ΔT1
Userbased kNN 0.8647 13.4%
SVD 0.8800 14.1%

Rate it again

● By asking users to rate items again we can
remove noise in the dataset
● Improvements of up to 14% in accuracy!
● Because we don't want all users to re-rate all
items we design ways to do partial denoising
● Data-dependent: only denoise extreme ratings
● User-dependent: detect “noisy” users

Denoising only noisy users

● Improvement in RMSE when doing onesource as a function of
the percentage of denoised ratings and users: selecting only noisy
users and extreme ratings

The value or a re-rating

Adding new ratings
increases performance
of the CF algorithm


But you are better off
doing re-rating than
new ratings !!


And much better if you
know which ratings to
re-rate!!

Let's recap

● Inconsistencies can depend on many things including
how the items are presented
● Inconsistencies produce natural noise
● Natural noise reduces our prediction accuracy
independently of the algorithm
● By asking (some) users to re-rate (some) items again
we can remove noise and improve accuracy
● Having users repeat existing ratings may have more
value than adding new ones

Crowds are not always wise

● Diversity of opinion
Conditions that are ● Independence
needed to guarantee the ● Decentralization
Wisdom in a Crowd ● Aggregation

Crowds are not always wise

vs.

Who won?

The Wisdom of the Few
X. Amatriain et al. "The wisdom of the few: a collaborative filtering
approach based on expert opinions from the web", SIGIR '09

“It is really only experts
who can reliably account
for their reactions”

Expert-based CF
● expert = individual that we can trust to have produced
thoughtful, consistent and reliable evaluations (ratings) of
items in a given domain
● Expert-based Collaborative Filtering
● Find neighbors from a reduced set of experts instead of
regular users.
1. Identify domain experts with reliable ratings
2. For each user, compute “expert neighbors”
3. Compute recommendations similar to standard kNN CF

User Study
● 57 participants, only 14.5 ratings/participant
● 50% of the users consider Expert-based CF to be
good or very good
● Expert-based CF: only algorithm with an average
rating over 3 (on a 0-4 scale)

Advantages of the Approach

● Noise ● Cold Start problem
● Experts introduce less ● Experts rate items as
natural noise soon as they are
● Malicious Ratings available
● Dataset can be monitored
● Scalability
to avoid shilling ● Dataset is several order of
● Data Sparsity magnitudes smaller
● Reduced set of domain
● Privacy
experts can be motivated ● Recommendations can be
to rate items computed locally

So...

● Can we generate meaningful and personalized
recommendations ensuring 100% privacy?
● YES!
● Can we have a recommendation algorithm that
is so efficient to run on a phone?
● YES!
● Can we have a recommender system that
works even if there is only one user?
● YES!

Some implementations

● A distributed Music Recommendation engine

Some implementations (II)

● A geo-localized Mobile Movie Recommender
iPhone App

Geo-localized Expert Movie
Recommendations

0

Powered by...

Expert CF...

● Recreates the old paradigm of manually finding
your favorite experts in magazines but in a fully
automatic non-supervised manner.

What if we don't have ratings?

The fascinating world of implicit user feedback

Examples of implicit feedback:
● Movies you watched

● Links you visited

● Songs you listened to

● Items you bought

● ....

Main features of implicit
feedback
● Our starting hypothesis are different from those
in previous works:
1.Implicit feedback can contain negative feedback –
given the right granularity and diversity, low
feedback = negative feedback
2.Numerical value of implicit feedback can be
mapped to preference given the appropriate
mapping
3.Once we have a trustworthy mapping, we can
evaluate implicit feedback predictions same way as
with explicit feedback.

Our questions

● Q1. Is it possible to predict ratings a user would
give to items given their implicit feedback?
● Q2. Are there other variables that affect this
mapping?

Experimental setup

● Online user study on the music domain
● Users required to have a music profile in lastfm
● Goal: Compare explicit ratings with their
listening history taking to account a number of
controlled variables

Results. Do explicit ratings relate
to implicit feedback?

Almost perfect linear
relation between ratings
and quantized implicit
feedback

Results. Do explicit ratings relate
to implicit feedback?

Extreme ratings have clear
ascending/descending
trend, but mild ratings
respond more to changes
in one direction

Results. Do other variables affect?

Albums listened to more
recently tend to receive
more positive ratings

Results. Do other variables affect?

Contrary to our expectations,
global album popularity does
not affect ratings

Results. What about user
variables?

● We obtained many demographic (age, sex, location...)
and usage variables (hours of music per week,
concerts, music magazines, ways of buying music...)
in the study.
● We performed an ANOVA analysis on the data to
understand which variables explained some of its
variance.
● Only one of the usage variables, contributed (Sig.
Value < 0.05) → “Listening Style” encoded whether the
user listened preferably to tracks, full albums, or both.

Results. Regression
Analysis
– Model 1: riu = β0 + β1 · ifiu
– Model 2: riu = β0 + β1 · ifiu + β2 · reiu
– Model 3: riu = β0 + β1 · ifiu + β2 · reiu + β3 · gpi
– Model 4: riu = β0 + β1 · ifiu + β2 · reiu + β3 · ifiu · reiu

Model R2 F-value p-value β0 β1 β2 β3

1 0.125 F (1, 10120) = 1146 < 2.2 · 10−16 2.726 0.499

2 0.1358 F (2, 10019) = 794.8 < 2.2 · 10−16 2.491 0.484 0.133

3 0.1362 F (3, 10018) = 531.8 < 2.2 · 10−16 2.435 0.486 0.134 0.0285

4 0.1368 F (3, 10018) = 534.7 < 2.2 · 10−16 2.677 0.379 0.038 0.053

All models meaningfully explain the data. Introducing “recentness”
improves 10% but “global popularity” or interaction between variables do
not make much difference

Results. Predictive power
Model RMSE –
Excluding non-rated items

User Average 1.131

1 1.026

2 1.017

3 1.016

4 1.016

Error in predicting 20% of the ratings, having trained our
regression model on the other 80%

Conclusions
● Recommender systems and similar applications
usually focus on having more data
● But... many times is not about having more but rather
better data
● User feedback can not always be treated as ground
truth and needs to be processed
● Crowds are not always wise and sometimes we are
better off using experts
● Implicit feedback represents a good alternative to
understand users but mapping is not trivial

Colleagues
● Josep M. Pujol and Nuria Oliver (Telefonica)
worked on Natural Noise and Wisdom of the
Few projects
● Nava Tintarev (Telefonica) worked on
Natural Noise

External Collaborators
● Neal Lathia (UCL, London), Haewook Ahn
(KAIST, Korea), Jaewook Ahn (Pittsburgh
Univ.), and Josep Bachs (UPF, Barcelona)
on Wisdom of the Few
● Denis Parra (Pittsburgh Univ.) worked on
implicit-explicit

Thanks!

Questions?

Xavier Amatriain
xar@tid.es
http://xavier.amatriain.net
http://technocalifornia.blogspot.com
@xamat

The Science and the Magic of User Feedback for Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Science and the Magic of User Feedback for Recommender Systems

Similar to The Science and the Magic of User Feedback for Recommender Systems (20)

More from Xavier Amatriain

More from Xavier Amatriain (20)

Recently uploaded

Recently uploaded (20)

The Science and the Magic of User Feedback for Recommender Systems