View
32
Download
0
Category
Preview:
Citation preview
Building a Large-Scale, Adaptive Recommendation Engine with
Apache Flink and SparkZoltán Zvara
zoltan.zvara@ilab.sztaki.huGábor Hermann
ghermann@ilab.sztaki.hu
This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
About us• Institute for Computer Science and Control, Hungarian Academy of
Sciences (MTA SZTAKI)• Informatics Laboratory• „Big Data – Momemtum” research group• „Data Mining and Search” research group
• Research group with strong industry ties• Ericsson, Rovio, Portugal Telekom, etc.
Agenda1. Recommendation systems and matrix factorization2. Batch vs. online3. Matrix factorization
1. Online2. Batch + online
4. Solution in Spark & Flink5. Conclusions
Recommendation systems
Recommendation systems
𝑅Recommendation with matrix factorization
5
1
3
5
2
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
min𝑢∗ ,𝑖∗
∑(𝑝 ,𝑞 )∈ 𝜅𝑅
(𝑟𝑝𝑞−𝜇−𝑏𝑝−𝑏𝑞−𝑢𝑝 𝑖𝑞)2+¿+𝜆 ∑𝑝∈𝜅𝑈
(‖𝑢𝑝‖2¿+𝑏𝑝
2 )+𝜆 ∑𝑞∈𝜅𝐼
(¿‖𝑖𝑞‖2+𝑏𝑞
2 )
¿¿
Zoltán rated Rogue One with 5 stars
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
?
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
3
𝑅Recommendation with matrix factorization
𝑈𝐼
𝑈 ∙ 𝐼 ≈𝑅
item vector325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
Level of actionLevel of dramaX factor
3
0
0
0
0
Latent factors
Zoltán
Gábor
Rogue One Interstellar
Zoltán rated Rogue One with 5 stars
Would Gábor like Interstellar?
5 4 -4
325
3
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
[user; item; time; rating]
𝑅Batch training
𝑈𝐼item vector
325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
PERSISTENT STORAGE
𝑅Online training
𝑈𝐼item vector
325
532
5 -6 -1
5 4 -4
5
1
3
uservector
5 3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅Online training
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅Online training
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
Batch + online combination
But how to scale?• Spotify streamed 20 billion hours of music in 2015• YouTube over a billion users, billions of video views every day• Use distributed data-analytics frameworks• How can we combine batch + online?
Apache Spark vs. Apache Flink
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
2 5 4 2 4
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
326
25 -6 -2
need to co-locate
𝑅Distributed online matrix factorization
𝑈𝐼item vector
326
532
5 -6 -2
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
135
24 -3 -1
need to co-locatethen update
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
1
3
uservector
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
[user; item; time; rating]
5 4 2 4
135
24 -3 -1
need to co-locatethen updatesend updates
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
𝑅Distributed online matrix factorization
𝑈𝐼item vector
135
532
4 -5 -1
5 4 -4
5
1
3
uservector
5
2
3
0
0
0
0
Zoltán
Gábor
Rogue One Interstellar
5 4 2 4
process two ratings in parallel
• Concurrent modification• Similar problem with batch SGD• Distributed SGD
(Gemulla et al. 2011)
Online MF in Spark
val ratings: DStream[Rating] = ...
we have our input
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?
Online MF in Spark
val ratings: DStream[Rating] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
updateStateByKey?Use batch DSGD for online updates!(discussion issue SPARK-6407)
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] =
we have our input
would like to have output like this
need to represent factor matrices
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) =>
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items)
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
Online MF in Spark
val ratings: DStream[Rating] = ...
var users: RDD[(UserId, Vector)] = ...var items: RDD[(ItemId, Vector)] = ...
val updateStream: DStream[Either[(UserId, Vector), (ItemId, Vector)]] = ratings.transform { (rs: RDD[Rating]) => val updates = batchDSGD(rs, users, items) users = applyUserUpdates(users, updates) items = applyItemUpdates(items, updates) updates }
we have our input
would like to have output like this
use transform to allow RDD operations
need to represent factor matrices
compute updates
apply updates to get updated matrices
Online MF in Spark• Performance decreases by time
Online MF in Spark• Performance decreases by time
• Problem: tracking lineage graph• Solution: use checkpointing
Online MF in Spark• Performance decreases by time
• Problem: tracking lineage graph• Solution: use checkpointing
Online MF in Flink
uservectors
itemvectors
long-running operators with state
Online MF in Flink
uservectors
itemvectors
long-running operators with state
backward edge in dataflow (stream loop)
Online MF in Flink
1. rating event
2
uservectors
itemvectors
Online MF in Flink
1. rating event 2. rating event & user vector
25 -6 -22
uservectors
itemvectors
Online MF in Flink
1. rating event 2. rating event & user vector 25 -6 -2
326
25 -6 -22
uservectors
itemvectors
Online MF in Flink
1. rating event 2. rating event & user vector
3. apply update
225 -6 -22
uservectors
itemvectors
4 -3 -1
135
Online MF in Flink
1. rating event 2. rating event & user vector
4. user vector update
3. apply update
225 -6 -22
uservectors
itemvectors
4 -3 -1
135
4 -3 -1
Online MF in FlinkWARNING!Loops API (iterative streams) not mature enough yet,but there is ongoing effort
1. rating event 2. rating event & user vector
4. user vector update
3. apply update
225 -6 -22
uservectors
itemvectors
4 -3 -1
135
4 -3 -1
Online MF: Spark vs. Flink
Combining batch + online in Spark• Easy: can run batch training periodically on whole dataset
Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system
Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system
• Batch with Streaming API• Feasible!• Asynchronous training
(Schelter et al. 2014)
Combining batch + online in Flink• Combining Flink Batch API with Streaming API• Could only do it with an external system
• Batch with Streaming API• Feasible!• Asynchronous training
(Schelter et al. 2014)
• Batch + online• Both with Streaming API• Share matrices in common state• Parameter Server approach
Lessons learned
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough (e.g. Loops API)
More mature
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough (e.g. Loops API)
More mature
Performance Optimal for online learning,can perform well on batch
Not always optimal for online learning (e.g. online MF)
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough (e.g. Loops API)
More mature
Performance Optimal for online learning,can perform well on batch
Not always optimal for online learning (e.g. online MF)
Handlingdata skew
Currently hard to relocatelong-running operators
Periodic scheduling enables easier modification of partitioning
Lessons learnedFlink Spark
Implementation More complex solution,harder to implement
Easier to use:could use batch for streaming
Generality Can express finer grained updates Updates limited by mini-batch
Code stability Some parts are not mature enough (e.g. Loops API)
More mature
Performance Optimal for online learning,can perform well on batch
Not always optimal for online learning (e.g. online MF)
Handlingdata skew
Currently hard to relocatelong-running operators
Periodic scheduling enables easier modification of partitioning
Machine learning Non-complete ML libraryand other efforts for ML in Flink
Spark MLlib is matureand used in production
Thank you for your attention
Zoltán Zvarazoltan.zvara@ilab.sztaki.hu
Gábor Hermannghermann@ilab.sztaki.hu
Source code:https://github.com/gaborhermann/large-scale-recommendation
Measurements
Batch + online combination• 30M music listening Last.fm dataset• Weekly batch training• Evaluation weekly average• on every incoming listening
• Around 45.000 users
Online MF: Spark vs. Flink• 30M music listening Last.fm dataset read from 12 Kafka partitions• Spark batch duration: 5 sec• Time of processing X ratings• DSGD algorithm
• Using 6 nodes, 4 cores each• Spark 2.1.0, Flink 1.2.0
Batch on Flink Streaming• Movielens 1M movie rating dataset• Using 6 nodes, 4 cores each
Recommended