There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
Building Large-Scale Recommendation Engines with Apache Flink & Spark
1. Building a Large-Scale, Adaptive
Recommendation Engine with Apache
Flink and Spark
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No 688191.
2. About us
• Institute for Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)
• Informatics Laboratory
• „Big Data – Momemtum” research group
• „Data Mining and Search” research group
• Research group with strong industry ties
• Ericsson, Rovio, Portugal Telekom, etc.
3. Agenda
1. Recommendation systems and matrix factorization
2. Batch vs. online
3. Matrix factorization
1. Online
2. Batch + online
4. Solution in Spark & Flink
5. Conclusions
6. 𝑅
Recommendation with matrix factorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
7. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
8. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
min
𝑢∗,𝑖∗
(𝑝,𝑞)∈𝜅 𝑅
𝑟𝑝𝑞 − 𝜇 − 𝑏 𝑝 − 𝑏 𝑞 − 𝑢 𝑝 𝑖 𝑞
2
+
+𝜆
𝑝∈𝜅 𝑈
( 𝑢 𝑝
2
+ 𝑏 𝑝
2
) + 𝜆
𝑞∈𝜅 𝐼
( 𝑖 𝑞
2
+ 𝑏 𝑞
2
)
Zoltán rated Rogue One
with 5 stars
9. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
10. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
11. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
12. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
?
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
13. 𝑅
Recommendation with matrix factorization
𝑈
𝑈 ∙ 𝐼 ≈ 𝑅
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
Level of action
Level of drama
X factor
3
0
0
0
0
Latent
factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One
with 5 stars
Would Gábor like Interstellar?
5 4 -4
3
2
5
3
14. [user; item; time; rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
15. [user; item; time; rating]
𝑅
Batch training
𝑈
item vector
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
16. [user; item; time; rating]
𝑅
Batch training
𝑈
item vector
3
2
5
5
3
2
5 -6 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
21. But how to scale?
• Spotify streamed 20 billion hours of music in 2015
• YouTube over a billion users, billions of video views every day
• Use distributed data-analytics frameworks
• How can we combine batch + online?
25. 𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
3
2
6
25 -6 -2
need to co-locate
26. 𝑅
Distributed online matrix factorization
𝑈
item vector
3
2
6
5
3
2
5 -6 -2
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
27. 𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
1
3
user
vector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
1
3
5
24 -3 -1
need to co-locate
then update
send updates
28. 𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
29. 𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
30. 𝑅
Distributed online matrix factorization
𝑈
item vector
1
3
5
5
3
2
4 -5 -1
5 4 -4
5
1
3
user
vector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
• Concurrent modification
• Similar problem with batch SGD
• Distributed SGD
(Gemulla et al. 2011)
31. Online MF in Spark
val ratings: DStream[Rating] = ...
we have our input
32. Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
33. Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
34. Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Use batch DSGD for online updates!
(discussion issue SPARK-6407)
35. Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
need to represent factor matrices
36. Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
37. Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
38. Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...
var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
ratings.transform { (rs: RDD[Rating]) =>
val updates = batchDSGD(rs, users, items)
users = applyUserUpdates(users, updates)
items = applyItemUpdates(items, updates)
updates
}
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
apply updates to get updated matrices
39. Online MF in Spark
• Performance decreases by time
40. Online MF in Spark
• Performance decreases by time
• Problem: tracking lineage graph
• Solution: use checkpointing
41. Online MF in Spark
• Performance decreases by time
• Problem: tracking lineage graph
• Solution: use checkpointing
42. Online MF in Flink
user
vectors
item
vectors
long-running operators with state
43. Online MF in Flink
user
vectors
item
vectors
long-running operators with state
backward edge in dataflow
(stream loop)
44. Online MF in Flink
1. rating event
2
user
vectors
item
vectors
45. Online MF in Flink
1. rating event 2. rating event & user vector
25 -6 -22
user
vectors
item
vectors
46. Online MF in Flink
1. rating event 2. rating event & user vector 25 -6 -2
3
2
6
25 -6 -22
user
vectors
item
vectors
51. Combining batch + online in Spark
• Easy: can run batch training periodically on whole dataset
52. Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
53. Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
54. Combining batch + online in Flink
• Combining Flink Batch API with Streaming API
• Could only do it with an external system
• Batch with Streaming API
• Feasible!
• Asynchronous training
(Schelter et al. 2014)
• Batch + online
• Both with Streaming API
• Share matrices in common state
• Parameter Server approach
57. Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
58. Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
59. Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
60. Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
61. Lessons learned
Flink Spark
Implementation More complex solution,
harder to implement
Easier to use:
could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough
(e.g. Loops API)
More mature
Performance Optimal for online learning,
can perform well on batch
Not always optimal for online
learning (e.g. online MF)
Handling
data skew
Currently hard to relocate
long-running operators
Periodic scheduling enables easier
modification of partitioning
Machine learning Non-complete ML library
and other efforts for ML in Flink
Spark MLlib is mature
and used in production
62. Thank you for your attention
Zoltán Zvara
zoltan.zvara@ilab.sztaki.hu
Gábor Hermann
ghermann@ilab.sztaki.hu
Source code:
https://github.com/gaborhermann/large-scale-recommendation
64. Batch + online combination
• 30M music listening Last.fm dataset
• Weekly batch training
• Evaluation weekly average
• on every incoming listening
• Around 45.000 users
65. Online MF: Spark vs. Flink
• 30M music listening Last.fm dataset read from 12 Kafka partitions
• Spark batch duration: 5 sec
• Time of processing X ratings
• DSGD algorithm
• Using 6 nodes, 4 cores each
• Spark 2.1.0, Flink 1.2.0
66. Batch on Flink Streaming
• Movielens 1M movie rating dataset
• Using 6 nodes, 4 cores each
Editor's Notes
Say that we focus on comparing the two systems for this use-case.
Say that we focus on comparing the two systems for this use-case.
Say that we focus on comparing the two systems for this use-case.
Ratings in a sparse matrix
Story: turned out it is worth to combine these two?Message: batch + online is better than batch alone, or online alone.DCG: Discounted Cumulative Gain, measures ranking quality, higher-betterhttps://en.wikipedia.org/wiki/Discounted_cumulative_gain