Learning to Personalize

1
Learning to Personalize
Justin Basilico
Page Algorithms Engineering September 19, 2014
@JustinBasilico
ATL 2014

2
 Interested in high-quality
recommendations
 Proxy question:
 Accuracy in predicted rating
 Measured by root mean
squared error (RMSE)
 Improve by 10% = $1 million!
 Data size:
 100M ratings (back then
“almost massive”)

4
Netflix Scale
 > 50M members
 > 40 countries
 > 1000 device types
 Hours: > 1B/month
 Plays: > 30M/day
 Log 100B events/day
 34.2% of peak US
downstream traffic

5
Goal
Help members find content to watch and enjoy
to maximize member satisfaction and retention

6
“Emmy Winning”
Approach to Recommendation

7
Everything is a Recommendation
Rows
Ranking
Over 75% of what
people watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning

8
Top Picks
Personalization awareness
Diversity

9
Personalized genres
 Genres focused on user interest
 Derived from tag combinations
 Provide context and evidence
 How are they generated?
 Implicit: Based on recent plays,
ratings & other interactions
 Explicit: Taste preferences

10
Similarity
 Recommend videos similar
to one you’ve liked
 “Because you watched”
rows
 Pivots
 Video information page
 In response to user actions
(search, list add, …)

11
Support for Recommendations
Behavioral Support Social Support

13
Machine Learning Approach
Problem
Data
Metrics
Model Algorithm

14
Data
 Plays
 Duration, bookmark, time,
device, …
 Ratings
 Metadata
 Tags, synopsis, cast, …
 Impressions
 Interactions
 Search, list add, scroll, …
 Social

15
Models & Algorithms
 Regression (Linear, logistic, elastic net)
 SVD and other Matrix Factorizations
 Factorization Machines
 Restricted Boltzmann Machines
 Deep Neural Networks
 Markov Models and Graph Algorithms
 Clustering
 Latent Dirichlet Allocation
 Gradient Boosted Decision
Trees/Random Forests
 Gaussian Processes
 …

16
Rating Prediction
 Based on first year progress prize
 Top 2 algorithms
 Matrix Factorization (SVD++)
 Restricted Boltzmann Machines
(RBM)
 Ensemble: Linear blend
Videos
R
≈
Users
U
V
(99% Sparse) d
Videos
Users
× d

17
Ranking by ratings
4.7 4.6 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5
Niche titles
High average ratings… by those who would watch it

19
Learning to Rank
 Approaches:
 Point-wise: Loss over items
(Classification, ordinal regression, MF, …)
 Pair-wise: Loss over preferences
(RankSVM, RankNet, BPR, …)
 List-wise: (Smoothed) loss over ranking
(LambdaMART, DirectRank, GAPfm, …)
 Ranking quality measures:
 Recall@N, Precision@N, NDCG, MRR,
ERR, MAP, FCP, …
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100
Importance
Rank
NDCG MRR FCP

20
Example: Two features, linear model
Popularity
Predicted Rating
1
2
3
4
5
Linear Model:
Final Ranking
frank(u,v) = w1 p(v) + w2 r(u,v)

22
“Learning to Row”
Putting a page together

23
Page-level algorithmic challenge
10,000s of
possible
rows …
10-40
rows
Variable number of
possible videos per
row (up to thousands)
1 personalized page
per device

24
Balancing a Personalized Page
Accurate vs. Diverse
Discovery vs. Continuation
Depth vs. Coverage
Freshness vs. Stability
Recommendations vs. Tasks

25
2D Navigational Modeling
More likely
to see
Less likely

26
Building a page algorithmically
 Approaches
 Template: Non-personalized layout
 Row-independent: Greedy rank rows by f(r | u, c)
 Stage-wise: Pick next rows by f(r | u, c, p1:n)
 Page-wise: Total page fitness f(p | u, c)
 Obey constraints
 Certain rows may be required (Continue Watching
and My List)
 Filter, de-duplicate
 Format for device

27
Row Features
 Quality of items
 Features of items
 Quality of evidence
 User-row interactions
 Item/row metadata
 Recency
 Item-row affinity
 Row length
 Position on page
 Title
 Diversity
 Similarity
 Freshness
 …

28
Page-level Metrics
 How do you measure the quality of
the homepage?
 Ease of discovery, Diversity,
Novelty, …
 Challenges:
 Position effects
 Row-video generalization
 2D versions of ranking quality
metrics
 Example: Recall @ row-by-column
0 10 20 30
Recall
Row

30
Three levels of Learning Distribution/Parallelization
1. For each subset of the population (e.g.
region)
 Want independently trained and tuned models
2. For each combination of hyperparameters
 Simple: Grid search
 Better: Bayesian optimization using Gaussian
Processes
3. For each subset of the training data
 Distribute over machines (e.g. ADMM)
 Multi-core parallelism (e.g. HogWild)
 Or… use GPUs

31
Example: Training Neural Networks
 Level 1: Machines in different
AWS regions
 Level 2: Machines in same AWS
region
 Spearmint or MOE for parameter
optimization
 Condor, StarCluster, Mesos, etc. for
coordination
 Level 3: Highly optimized, parallel
CUDA code on GPUs

33
Evolution of Recommendation Approach
4.7
Rating Ranking Page Generation

34
Research Directions
Context
awareness
Full-page
optimization
Presentation
effects
Scaling
Personalized
learning to
rank
Cold start

Thank You Justin Basilico
jbasilico@netflix.com
35 @JustinBasilico
We’re hiring

Learning to Personalize

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning to Personalize

Similar to Learning to Personalize (20)

Recently uploaded

Recently uploaded (20)

Learning to Personalize