How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.
5. Collaborative Filtering
§ Recommend items based on users with
similar preferences
§ Latent factor models capture underlying
characteristics of items and preferences
of user
§ Predicted preference:
4
ˆru,i = xu
T
yi
7. Matrix Factorization
§ Calculate low rank approximation to obtain latent factors
6
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
2
+ ni yi
2
i
∑
u
∑
#
$
%
&
'
(
ru,i≠0
∑
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
Princess
Facebook
§ = rating of user u for item i
§ = latent factors of user u
§ = latent factors of item i
§ = regularization constant
§ = number of rated items for user u
§ = number of ratings for item i
xu
yi
λ
nu
ni
ru,i
8. Alternating Least Squares
§ Hard to optimize since we have two
variables
§ Fixing one variables gives quadratic
problem
7
X,Y
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
We
only
need
the
item
vectors
rated
by
user
u
18. Iterate by looping
§ for/while loop in client submits one job per
iteration step
§ Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
17
23. Naïve Implementation
1. Join item vectors
with ratings
2. Group on user ID
3. Compute new
user vectors
22
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
24. Pros and Cons of Naïve ALS
§ Pros
• Easy to implement
§ Cons
• Item vectors are sent redundantly to network
nodes
• Two shuffle steps make execution expensive
23
25. Blocked ALS Implementation
1. Create user and item
rating blocks
2. Cache them on
worker nodes
3. Send all item vectors
needed by user
rating block
bundled
4. Compute block of
item vectors
24
Based
on
Spark’s
MLlib
implementaBon
26. Pros and Cons of Blocked ALS
§ Pros
• Reduces network load by avoiding data
duplication
• Caching ratings: Only one shuffle step needed
§ Cons
• Duplicates the rating matrix (user block/item
block partitioning)
25
27. Performance Comparison
26
• 40
node
GCE
cluster,
highmem-‐8
• 10
ALS
iteraBons
with
50
latent
factors
• RaBng
matrix
has
28
billion
non
zero
entries:
Scale
of
NeAlix
or
SpoCfy
28. Machine Learning with FlinkML
§ FlinkML contains
blocked ALS
§ Support for many
other tasks
• Clustering
• Regression
• Classification
§ scikit-learn like
pipeline support
27
val
als
=
ALS()
val
ratingDS
=
env.readCsvFile[(Int,
Int,
Double)](ratingData)
val
parameters
=
ParameterMap()
.add(ALS.Iterations,
10)
.add(ALS.NumFactors,
50)
.add(ALS.Lambda,
1.5)
als.fit(ratingDS,
parameters)
val
testingDS
=
env.readCsvFile[(Int,
Int)](testingData)
val
predictions
=
als.predict(testingDS)
30. What Have You Seen?
§ How to use collaborative filtering to make
recommendations
§ Apache Flink, a powerful parallel stream
processing engine
§ How to use Apache Flink and alternating least
squares to factorize really large matrices
29