Fast ALS-Based Matrix Factorization for Recommender Systems

Fast ALS-Based Matrix
Factorization for
Recommender Systems
David Zibriczky
LAWA Workpackage Meeting
16th January, 2013

Problem setting
16th January, 20132

Item Recommendation
• Classical item recommendation problem (see Netflix)
• Explicit feedbacks (ratings)
16th January, 20133 LAWA Workpackage Meeting
5 ?
?
The Matrix The Matrix 2 Twilight The Matrix 3
?

Collaborative Filtering (Explicit)
• Classical item recommendation problem (see Netflix)
• Explicit feedbacks (ratings)
• Collaborative Filtering
• Based on other users
5
5
4
5
5
?
?
The Matrix 3The Matrix The Matrix 2 Twilight
5
?

Collaborative Filtering (Implicit)
• Items are not movies only (live content, products, holidays, …)
• Implicit feedbacks (buy, view, …)
• Less information about pref.
?
?
Item4Item1 Item2 Item3
?

Industrial motivation
• Keeping the response time low
• Up-to-date user models, the adaptation should be fast
• The items may change rapidly, the training time can be a bottleneck of
live performance
• Increasing amount of data from a customer  Increasing training time
• Limited resources

Model
16th January, 20137

Preference Matrix
• Matrix representation
• Implicit Feedbacks: Assuming
positive preference
• Value = 1
• Estimation of unknown preference?
• Sorting items by estimation  Item
Recommendation
R Item1 Item2 Item3 Item4
User1 1 ? ? ?
User2 ? ? 1 ?
User3 1 1 ? ?
User4 ? 1 ? 1

Matrix Factorization
𝑹 = 𝑷𝑸 𝑻
𝑟 𝑢𝑖 = 𝒑 𝑢
𝑇 𝒒𝑖
𝑹 𝑵𝒙𝑴: preference matrix
𝑷 𝑵𝒙𝑲: user feature matrix
𝑸 𝑴𝒙𝑲: item feature matrix
𝑵: #users
𝑴: #items
𝑲: #features
𝑲 ≪ 𝑴, 𝑲 ≪ 𝑵
R Item1 Item2 Item3 …
User1
User2 𝒓 𝑢𝑖
User3
…
P
𝒑 𝑢
𝑇
QT
𝒒𝑖
𝒑 𝒖 ≔ 𝑷 𝒖 𝑻
𝒒𝒊 ≔ 𝑸 𝒊 𝑻

Objective Function
16th January, 201310

Preference Matrix
User1 1
User2 1
User3 1 1
User4 1 1

• Zero value for unknown preference (zero example). Many 0s, few 1s, in practice
Preference Matrix
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

• Zero value for unknown preference (zero example). Many 0s, few 1s, in practice-
• 𝒄 𝑢𝑖 confidence for known feedback (constant or function of the context of event)
• Zero examples are less important, but important.
Confidence Matrix
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1
C Item1 Item2 Item3 Item4
User1 𝒄11 1 1 1
User2 1 1 𝒄23 1
User3 𝒄31 𝒄32 1 1
User4 1 𝒄42 1 𝒄44

• Objective function:
Weighted Sum of Squared Errors
C Item1 Item2 Item3 Item4
User1 𝒄11 1 1 1
User2 1 1 𝒄23 1
User3 𝒄31 𝒄32 1 1
User4 1 𝒄42 1 𝒄44
𝒇 𝑷, 𝑸 = 𝑾𝑺𝑺𝑬 =
(𝒖,𝒊)
𝒄 𝒖𝒊 𝒓 𝒖𝒊 − 𝒓 𝒖𝒊
𝟐 𝑷 = ?
𝑸 = ?
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

Optimizer

• Ridge Regression
• 𝑝 𝑢 = 𝑄 𝑇
𝐶 𝑢
𝑄 −1
𝑄 𝑇
𝐶 𝑢
𝑅 𝑟 𝑢
• 𝑞𝑖 = 𝑃 𝑇
𝐶 𝑖
𝑃
−1
𝑃 𝑇
𝐶 𝑖
𝑅 𝑐 𝑖
Optimizer – Alternating Least Squares
QT
0.1 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
P
-0.2 0.6
0.6 0.4
0.7 0.2
0.5 -0.2
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

• 𝑝 𝑢 = 𝑄 𝑇
𝐶 𝑢
𝑄 −1
𝑄 𝑇
𝐶 𝑢
𝑅 𝑟 𝑢
𝐶 𝑖
𝑃
−1
𝑃 𝑇
𝐶 𝑖
𝑅 𝑐 𝑖
QT
0.3 -0.3 0.7 0.7
0.7 0.8 -0.5 -0.1
P
-0.2 0.6
0.6 0.4
0.7 0.2
0.5 -0.2
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

• 𝑝 𝑢 = 𝑄 𝑇
𝐶 𝑢
𝑄 −1
𝑄 𝑇
𝐶 𝑢
𝑅 𝑟 𝑢
𝐶 𝑖
𝑃
−1
𝑃 𝑇
𝐶 𝑖
𝑅 𝑐 𝑖
QT
0.3 -0.3 0.7 0.7
0.7 0.8 -0.5 -0.1
P
-0.2 0.7
0.6 0.5
0.8 0.2
0.6 -0.2
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

• Complexity of naive solution: 𝚶 𝑰𝑲 𝟐 𝑵𝑴 + 𝑰𝑲 𝟑 𝑵 + 𝑴
𝑬: number of examples, 𝑰 : number of iterations
• Improvement (Hu, Koren, Volinsky)
 Ridge Regression: 𝑝 𝑢 = 𝑄 𝑇
𝐶 𝑢
𝑄 −1
𝑄 𝑇
𝐶 𝑢
𝑅 𝑟 𝑢
 𝑄 𝑇
𝐶 𝑢
𝑄 = 𝑄 𝑇
𝑄 + 𝑄 𝑇
𝐶 𝑢
− 𝐼 𝑄 = 𝐶𝑂𝑉𝑄0 + 𝐶𝑂𝑉𝑄+, 𝚶(𝑰𝑲 𝟐
𝑵𝑴) is costly
 𝐶𝑂𝑉𝑄0 is user independent, need to be calculated at the start of the iteration
 Calculating 𝐶𝑂𝑉𝑄+ needs only #𝑷(𝒖)+
steps.
o #𝑷(𝒖)+
: number of positive examples of user u
 Complexity: 𝜪 𝑰𝑲 𝟐
𝑬 + 𝑰𝑲 𝟑
(𝑵 + 𝑴) = 𝜪 𝑰𝑲 𝟐
(𝑬 + 𝑲(𝑵 + 𝑴)
 Codename: IALS
• Complexity issues on large dataset:
 If 𝑲 is low: 𝜪(𝑰𝑲 𝟐 𝑬) is dominant
 If 𝑲 is high: 𝑶(𝑰𝑲 𝟑
(𝑵 + 𝑴)) is dominant
19 LAWA Workpackage Meeting 16th January, 2013

Problem: Complexity

Ridge Regression with Coordinate Descent
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
P
? ? ?

• Initialize with zero values
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
P
0 0 0

P
0.51 0 0
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
• Target vector: 𝒆 𝒖= 𝑪 𝒖 𝒓 𝒖 − 𝒑 𝒖 𝑸 𝑻
• Optimize only one feature of 𝑝 𝑢 at once
• 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑐 𝑢𝑖 𝑞 𝑖𝑘 𝑒 𝑢𝑖
𝑖=1
𝑀
𝑐 𝑢𝑖 𝑞 𝑖𝑘 𝑞 𝑖𝑘
=
𝑆𝑄𝐸
𝑆𝑄𝑄
• 𝑒 𝑢𝑖 = 𝑒 𝑢𝑖 − 𝑝 𝑢𝑘 𝑒 𝑢𝑖 𝑐 𝑢𝑖
• Apply more iteration

P
0.51 0.10 0
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
• 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑖=1
𝑀
=
𝑆𝑄𝐸
𝑆𝑄𝑄

P
0.51 0.10 0.08
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
• 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑖=1
𝑀
=
𝑆𝑄𝐸
𝑆𝑄𝑄

P
0.47 0.10 0.08
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
• 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑖=1
𝑀
=
𝑆𝑄𝐸
𝑆𝑄𝑄

P
0.46 0.11 0.07
User1 1 0 0 0
QT
0.9 -0.4 0.8 0.6
0.6 0.7 -0.7 -0.2
-0.1 -0.4 -0.1 0.6
• 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑖=1
𝑀
=
𝑆𝑄𝐸
𝑆𝑄𝑄

Optimizer – Coordinate Descent
QT
0.1 0.4 1.1 0.6
0.6 0.7 1.5 1.0
P
0.3 0
0 0
0 0
0 0
• Ridge Regression with Coordinate Descent
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 1.1 0.6
0.6 0.7 1.5 1.0
P
0.3 -0.1
0 0
0 0
0 0
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 1.1 0.6
0.6 0.7 1.5 1.0
P
0.3 -0.1
0.1 0
0 0
0 0
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 1.1 0.6
0.6 0.7 1.5 1.0
P
0.3 -0.1
0.1 0.5
0 0
0 0
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 1.1 0.6
0.6 0.7 1.5 1.0
P
0.3 -0.1
0.1 -0.5
-0.4 0.2
0.5 -0.4
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0 0 0
0 0 0 0
P
0.3 -0.1
0.1 -0.5
-0.4 0.2
0.5 -0.4
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0 0 0
0.6 0 0 0
P
0.3 -0.1
0.1 -0.5
-0.4 0.2
0.5 -0.4
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 0 0
0.6 0 0 0
P
0.3 -0.1
0.1 -0.5
-0.4 0.2
0.5 -0.4
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 -0.1 0.2
0.6 0.7 0.8 0.5
P
0.3 -0.1
0.1 -0.5
-0.4 0.2
0.5 -0.4
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 -0.1 0.2
0.6 0.7 0.8 0.5
P
0.2 0
0 0
0 0
0 0
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 -0.1 0.2
0.6 0.7 0.8 0.5
P
0.2 -0.1
0 0
0 0
0 0
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

QT
0.1 0.4 -0.1 0.2
0.6 0.7 0.8 0.5
P
0.2 -0.1
0.1 -0.4
-0.3 0.1
0.5 -0.6
User1 1 0 0 0
User2 0 0 1 0
User3 1 1 0 0
User4 0 1 0 1

• Complexity of naive solution: 𝚶 𝑰𝑲𝑵𝑴
• Ridge Regression calculates the features based on examples directly,
Covariance precomputing solution cannot be applied here.

Optimizer – Coordinate Descent Improvement
• Synthetic examples (Pilászy, Zibriczky, Tikk)
• Solution of Ridgre Regression with CD: 𝑝 𝑢𝑘 = 𝑖=1
𝑀
𝑖=1
𝑀
=
𝑆𝑄𝐸
𝑆𝑄𝑄
• Calculate statistics for this user, who watched nothing (𝑆𝐸𝑄0 and 𝑆𝑄𝑄0)
• The solution is calculated incrementally: 𝑝 𝑢𝑘 =
𝑆𝑄𝐸
𝑆𝑄𝑄
=
𝑆𝑄𝐸0+𝑆𝑄𝐸+
𝑆𝑄𝑄0+𝑆𝑄𝑄+
( 𝑴 + #𝑷(𝒖)+ steps)
• Eigenvalue decomposition: 𝑄 𝑇
𝑄 = 𝑆Λ𝑆 𝑇
= 𝑆 Λ
𝑇
Λ𝑆 = 𝐺 𝑇
𝐺
• Zero examples are compressed to synthetic examples: 𝑄 𝑀𝑥𝐾 → 𝐺 𝐾𝑥𝐾
• 𝑆𝐺𝐺0 = 𝑆𝑄𝑄0, but needs only 𝐊 steps to compute: 𝑝 𝑢𝑘 =
𝑺𝑮𝑬 𝟎+𝑆𝑄𝐸+
𝑺𝑮𝑮 𝟎+𝑆𝑄𝑄+
( 𝑲 + #𝑷(𝒖)+ steps)
• 𝑆𝐺𝐸0 is calculated the same way as 𝑆𝑄𝐸0, but using 𝐊 steps only.
• Complexity: 𝛰 𝐼𝐾(𝐸 + 𝐾𝑀 + 𝐾𝑁)) = 𝚶 𝑰𝑲(𝑬 + 𝑲(𝑴 + 𝑵)

• Complexity of naive solution: 𝚶 𝑰𝑲𝑵𝑴
• Ridge Regression calculates the features based on examples directly,
Covariance precomputing solution cannot be applied here.
• Synthetic Examples
• Codename: IALS1
• Complexity reduction (IALSIALS1)
𝜪 𝑰𝑲(𝑬 + 𝑲(𝑴 + 𝑵)
• IALS1 requires higher 𝑲 for the same accuracy as IALS.

...does it work in practice?

• Average Rank Position on the subset of a propietary implicit feedback dataset. The lower
value is better.
• IALS1 offers better time-accuracy tradeoffs, especially when K is large.
Comparison
IALS IALS1
K ARP time ARP time
5 0,1903 153 0,1898 112
10 0,1578 254 0,1588 134
20 0,1427 644 0,1432 209
50 0,1334 2862 0,1344 525
100 0,1314 11441 0,1325 1361
250 0,1311 92944 0,1312 6651
500 N/A N/A 0,1282 24697
1000 N/A N/A 0,1242 104611
0,120
0,125
0,130
0,135
0,140
0,145
0,150
0,155
100 1000 10000 100000
ARP
Training Time (s)
IALS IALS1

Conclusion
• Explicit feedbacks are rarely or not provided.
• Implicit feedbacks are more general.
• Complexity issues of Alternating Least Squares.
• Efficient solution by using approximation and synthetic examples.
• IALS1 offers better time-accuracy tradeoffs, especially when 𝑲 is large.
• IALS is approximation algorithm too, so why not change it to be even
more approximative?

Other algorithms

Model – Tensor Factorization
• Different preferences during the day
• Time period 1: 06:00-14:00
R1 Item1 Item2 Item3 …
User1 1 …
User2 1 …
User3 …
…. … … … …

• Time period 2: 14:00-22:00
User1 1 …
User2 1 0 …
User3 …
…. … … … …
User1 1 …
User2 1 …
User3 1 …
…. … … … …

• Time period 3: 22:00-06:00
User1 1 …
User2 1 0 …
User3 …
…. … … … …
User1 0 1 …
User2 1 …
User3 1 …
…. … … … …
User1 1 …
User2 …
User3 1 1 …
…. … … … …

User1 1 …
User2 1 0 …
User3 …
…. … … … …
User1 0 1 …
User2 1 …
User3 1 …
…. … … … …
User1 …
User2 𝒓 𝑢𝑖𝑡 …
User3 …
…. … … … …
QT
q11 q21 q31 …
q12 q22 q32 …
P
p11 p12
p21 p22
p31 p32
… …
Tt11
t12
t21
t22
t31
t32
𝑹 𝑵𝒙𝑴: preference matrix
𝑷 𝑵𝒙𝑲: user feature matrix
𝑸 𝑴𝒙𝑲: item feature matrix
𝑻 𝑳𝒙𝑲: time feature matrix
𝑵: #users
𝑴: #items
𝑳: #time periods
𝑲: #features
𝒓 𝒖𝒊t =
𝒌
𝒑 𝒖𝒌 𝒒𝒊𝒌 𝒕𝒕𝒌
𝑹 = 𝑷° 𝑸° 𝑻

• Data sets: Netflix Rating 5, IPTV Provider VOD rental, Grocery buys
• Evaluation Metric: Recall@20, Precision-Recall@20
• Number of features: 20
Comparison – ITALS vs. IALS
Test case (20) IALS ITALS
Netflix Probe 0.087 0.097
Netflix Time Split 0.054 0.071
IPTV VOD 1day 0.063 0.112
IPTV VOD 1week 0.055 0.100
Grocer 0.065 0.103

Comparison – ITALS vs. IALS

Objective Function – Ranking-based objective function
• Ranking-based objective function approach:
• 𝒓 𝒖𝒊 − 𝒓𝒖𝒋 : difference of preference between item i and j
• 𝒓 𝒖𝒊 − 𝒓 𝒖𝒋 : estimated difference of preference between item i and j
• 𝒔𝒋: importance of item j in objective function
• Model: Matrix Factorization
• Optimizer: Alternating Least Squares
• Name: RankALS
𝒇 𝜽 =
𝒖𝝐𝑼 𝒊𝝐𝑰
𝒄 𝒖𝒊
𝒊𝝐𝑰
𝒔𝒋[ 𝒓 𝒖𝒊 − 𝒓 𝒖𝒋 − 𝒓 𝒖𝒊 − 𝒓 𝒖𝒋 ] 𝟐

Comparison – RankIALS vs. IALS

Related Publications
• Alternating Least Squares with Coordinate Descent
I. Pilászy, D. Zibriczky, D. Tikk. Fast ALS-based matrix factorization for explicit and
implicit feedback datasets. RecSys 2010
• Tensor Factorization
B. Hidasi, D. Tikk: Fast ALS-Based Tensor Factorization for Context-Aware
Recommendation from Implicit Feedback, ECML PKDD 2012
• Personalized Ranking
G. Takács, D. Tikk: Alternating least squares for personalized ranking, RecSys 2012
• IPTV Case Study
D. Zibriczky, B. Hidasi, Z. Petres, D. Tikk: Personalized recommendation of linear content
on interactive TV platforms: beating the cold start and noisy implicit user feedback,
TVMMP @ UMAP 2012

Fast ALS-Based Matrix Factorization for Recommender Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from David Zibriczky

More from David Zibriczky (9)

Recently uploaded

Recently uploaded (20)

Fast ALS-Based Matrix Factorization for Recommender Systems