This document provides an overview of recommendation systems and collaborative filtering techniques. It discusses using matrix factorization to predict user ratings by representing users and items as vectors in a latent factor space. Optimization techniques like stochastic gradient descent can be used to learn the factorization from existing ratings. The document also notes challenges of sparsity and scale for practical systems and describes approaches like elastic net regularization and sparsification to address these.
ICT role in 21st century education and its challenges
Recommendation System --Theory and Practice
1. Recommendation System
— Theory and Practice
IMI Colloquium @ Kyushu Univ.
February 18, 2015
Kimikazu Kato
Silver Egg Technology
1 / 27
2. About myself
Kimikazu Kato
Ph.D in computer science, Master's degree in mathematics
Experience in numerical computation, especially ...
Geometric computation, computer graphics
Partial differential equation, parallel computation, GPGPU
Now specialize in
Machine learning, especially, recommendation system
2 / 27
3. About our Company
Silver Egg Technology
Established: 1998
CEO: Tom Foley
Main Service: Recommendation System, Online Advertisement
Major Clients: QVC, Senshukai (Bell Maison), Tsutaya
We provide a recommendation system to Japan's leading web sites.
3 / 27
4. Today's Story
Introduction to recommendation system
Rating prediction
Shopping behavior prediction
Practical viewpoint
Conclusion
4 / 27
5. Recommendation System
Recommender systems or recommendation systems (sometimes
replacing "system" with a synonym such as platform or engine) are a
subclass of information filtering system that seek to predict the
'rating' or 'preference' that user would give to an item. — Wikipedia
In this talk, we focus on collaborative filtering method, which only utilize
users' behavior, activity, and preference.
Other methods includes:
Content-based methods
Method using demographic data
Hybrid
5 / 27
6. Our Service and Mechanism
ASP service named "Aigent Recommender"
Works as an add-on to the existing web site.
6 / 27
7. Netflix Prize
The Netflix Prize was an open competition for the best collaborative
filtering algorithm to predict user ratings for films, based on previous
ratings without any other information about the users or films, i.e.
without the users or the films being identified except by numbers
assigned for the contest. — Wikipedia
Shortly, an open competition for preference prediction.
Closed in 2009.
7 / 27
8. Description of the Problem
usermovie W X Y Z
A 5 4 1 4
B 4
C 2 3
D 1 4 ?
Given rating information for some user/movie pairs,
is it possible to predict a rating for an unknown user/movie pair?
8 / 27
9. Notations
Number of users:
Set of users:
Number of items (movies):
Set of items (movies):
Input matrix: ( matrix)
n
U = {1, 2, …, n}
m
I = {1, 2, …, m}
A n × m
9 / 27
10. Matrix Factorization
Based on the assumption that each item is described by a small number of
latent factors
Each rating is expressed as a linear combination of the latent factors
Achieve good performance in Netflix Prize
Find such matrices , where
A ≈ YX
T
X ∈ Mat(f , n) Y ∈ Mat(f , m) f ≪ n, m
10 / 27
11. Find and maximize
p(A|X, Y , σ) = ( | , σ)
∏
≠0aui
Aui X
T
u
Yi
p(X| ) = ( |0, I)σX
∏
u
Xu σX
p(Y | ) = ( |0, I)σY
∏
i
Yi σY
X Y p(X, Y |A, σ)
11 / 27
12. According to Bayes' Theorem,
Thus,
where means Frobenius norm.
How can this be computed? Use MCMC. See [Salakhutdinov et al., 2008].
Once and are determined, and the prediction for is
estimated by
p(X, Y |A, σ)
= p(A|X, Y , σ)p(X| )p(X| ) × const.σX σX
log p(U, V |A, σ, , )σU σV
= ( − ) + ∥X + ∥Y + const.
∑
Aui
Aui X
T
u
Yi λX ∥2
Fro
λY ∥2
Fro
∥ ⋅ ∥Fro
X Y := YA
~
X
T
Aui
A
~
ui
12 / 27
13. Rating
usermovie W X Y Z
A 5 4 1 4
B 4
C 2 3
D 1 4 ?
Includes negative feedback
"1" means "boring"
Zero means "unknown"
Shopping (Browsing)
useritem W X Y Z
A 1 1 1 1
B 1
C 1
D 1 1 ?
Includes no negative feedback
Zero means "unknown" or
"negative"
More degree of the freedom
Difference between Rating and Shopping
Consequently, the algorithm effective for the rating matrix is not necessarily
effective for the shopping matrix.
13 / 27
14. Evaluation Metrics for Recommendation
Systems
Rating prediction
The Root of the Mean Squared Error (RMSE)
The square root of the sum of squared errors
Shopping prediction
Precision
(# of Recommended and Purchased)/(# of Recommended)
Recall
(# of Recommended and Purchased)/(# of Purchased)
The criteria are different. This is another reason different algorithms should
be applied.
14 / 27
16. Adding a Constraint
The problem is the too much degree of freedom
Desirable characteristic is that many elements of the product should be
zero.
Assume that a certain ratio of zero elements of the input matrix remains
zero after the optimization [Sindhwani et al., 2010]
Experimentally outperform the "zero-as-negative" method
16 / 27
17. [Sindhwani et al., 2010]
Introduced variables to relax the problem.
Minimize
subject to
pui
( − ) + ∥X + ∥Y
∑
!=0Aui
Aui X
T
u
Yi λX ∥2
Fro
λY ∥2
Fro
+ [ (0 − − (1 − )(1 − ]∑
=0Aui
pui
X
T
u
Yi )
2
pui
X
T
u
Yi )
2
+T [− log − (1 − ) log(1 − )]
∑
=0Aui
p
ui
p
ui
p
ui
p
ui
= r
1
|{ | = 0}|Aui Aui
∑
=0Aui
pui
17 / 27
18. Ranking prediction
Another strategy of shopping prediction
"Learn from the order" approach
Predict whether X is more likely to be bought than Y, rather than the
probability for X or Y.
18 / 27
19. Bayesian Probabilistic Ranking
[Rendle et al., 2009]
Consider matrix factorization model, but the update of elements is
according to the observation of the "orders"
The parameters are the same as usual matrix factorization, but the
objective function is different
Consider a total order for each . Suppose that means
"the user is more likely to buy than .
The objective is to calculate such that and (which means
and are not bought by ).
>u u ∈ U i j(i, j ∈ I)>u
u i j
p(i j)>u = 0Aui Auj
i j u
19 / 27
20. Let
and define
where we assume
According to Bayes' theorem, the function to be optimized becomes:
= {(u, i, j) ∈ U × I × I| = 1, = 0},DA Aui Auj
p( |X, Y ) := p(i j|X, Y )
∏
u∈U
>u
∏
(u,i,j)∈DA
>u
p(i j|X, Y )>u
σ(x)
= σ( − )X
T
u
Yi Xu Yj
=
1
1 + e
−x
∏
p(X, Y | ) =
∏
p( |X, Y ) × p(X)p(Y ) × const.>u >u
20 / 27
21. Taking log of this,
Now consider the following problem:
This means "find a pair of matrices which preserve the order of the
element of the input matrix for each ."
L := log
[∏
p( |X, Y ) × p(X)p(Y )
]
>u
= log p(i j|X, Y ) − ∥X − ∥Y
∏
(u,i,j)∈DA
>u λX ∥2
Fro
λY ∥2
Fro
= log σ( − ) − ∥X − ∥Y
∑
(u,i,j)∈DA
X
T
u
Yi X
T
u
Yj λX ∥2
Fro
λY ∥2
Fro
[
log σ( − ) − ∥X − ∥Y
]
max
X,Y
∑
(u,i,j)∈DA
X
T
u
Yi X
T
u
Yj λX ∥2
Fro
λY ∥2
Fro
X, Y
u
21 / 27
22. Computation
The function we want to optimize:
is huge, so in practice, a stochastic method is necessary.
Let the parameters be .
The algorithm is the following:
Repeat the following
Choose randomly
Update with
This method is called Stochastic Gradient Descent (SGD).
log σ( − ) − ∥X − ∥Y
∑
(u,i,j)∈DA
X
T
u
Yi X
T
u
Yj λX ∥2
Fro
λY ∥2
Fro
U × I × I
Θ = (X, Y )
(u, i, j) ∈ DA
Θ
Θ = Θ − α (log σ( − ) − ∥X − ∥Y )
∂
∂Θ
X
T
u
Yi X
T
u
Yj λX ∥2
Fro
λY ∥2
Fro
22 / 27
23. Practical Aspect of Recommendation
Problem
Computational time
Memory consumption
How many services can be integrated in a server rack?
Super high accuracy with a super computer is useless for real business
23 / 27
24. Sparsification
As an expression of a big matrix, a sparse matrix can save computational
time and memory consumption at the same time
It is advantageous to employ a model whose parameters become sparse
24 / 27
25. Example of sparse model: Elastic Net
In the regression model, adding L1 term makes the solution sparse:
The similar idea is used for the matrix factorization [Ning et al., 2011]:
Minimize
subject to
[
∥Xw − y + ∥w + λρ|w
]
min
w
1
2n
∥2
2
λ(1 − ρ)
2
∥2
2
|1
∥A − AW∥ + ∥W + λρ|W
λ(1 − ρ)
2
∥2
Fro
|1
diag W = 0
25 / 27
26. Conclusion: What is Important for Good
Prediction?
Theory
Machine learning
Mathematical optimization
Implementation
Algorithms
Computer architecture
Mathematics
Human factors!
Hand tuning of parameters
Domain specific knowledge
26 / 27
27. References
Salakhutdinov, Ruslan, and Andriy Mnih. "Bayesian probabilistic matrix
factorization using Markov chain Monte Carlo." Proceedings of the 25th
international conference on Machine learning. ACM, 2008.
Sindhwani, Vikas, et al. "One-class matrix completion with low-density
factorizations." Data Mining (ICDM), 2010 IEEE 10th International
Conference on. IEEE, 2010.
Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit
feedback." Proceedings of the Twenty-Fifth Conference on Uncertainty in
Artificial Intelligence. AUAI Press, 2009.
Zou, Hui, and Trevor Hastie. "Regularization and variable selection via the
elastic net." Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.2 (2005): 301-320.
Ning, Xia, and George Karypis. "SLIM: Sparse linear methods for top-n
recommender systems." Data Mining (ICDM), 2011 IEEE 11th
International Conference on. IEEE, 2011.
27 / 27