Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Wei Tan
IBM T. J. Watson Research Center
http://github.com/cumf
http://researcher.ibm.com/person/us-wtan
Large Scale Matrix Factorization: Systems and Acceleration
Agenda
Fei’s talk covers the formalism/theory/math of MF
My talk focuses on “how to run it fast, scalable and cost-efficient”
– Matrix factorization, SGD and ALS (10 min)
– Parallelize and accelerate SGD and ALS (20 min)
– GPU accelerated SGD and ALS (20 min)
– Conclusion and QA (10 min)
2
Matrix Factorization
3
Ratings (R)
n items
m users
* * * *
*
*
*
*
≈
x
Users
items
T
xT
u
vX
f
f
MF Explained using Recommender Systems
How: factorize the rating matrix R into
and minimize the empirical lost:
R X
xu
ΘT θv
Input: users ratings on some items
Output: user/item features
Use: predict missing ratings; use
features for other tasks (e.g. clustering)
X
ΘT
word
word X
ΘT
word
docum
ent
Topic model Word embedding
4
Matrix Factorization is a Key
Predict missing ratings
Group similar users/items
Match query and document In machine learning and HPC applications
Matrix Factorization Link prediction
Vertices clustering
Latent semantic model
Word embedding as input to DNN
Recommender systems
Complex network
Web search
Natural language processing
Tensor decomposition
Model compression
Embedding layer
Deep learning
Supported in cuMF
To be supported
5
Ratings (R)
n items
mu
sers
* * **
*
*
*
*
≈
x
Users
items
T
xT
u
vX
f
f
Challenge: MF needs to be fast, scalable, economic
6
Fast – Recommend/update timely
Scalable – Facebook: 100 B ratings, 1 B users
Economic – Avoid large infrastructure
Stochastic gradient descent (SGD) - Update takes one rating at a time
- Vector inner product: memory bound
- Need many light epochs
- Parallelize: non-trivial
- Handle dense (implicit) ratings: no
To Solve MF: SGD
7
xu1
θv1
xu2
xu3
θv2 θv3
To Solve MF: ALS
Alternating Least Square (ALS) - Update takes ALL rating at a time
- Vector outer product & solve: compute bound
- Need few heavy epochs
- Parallelize: straightforward
- Handle dense (implicit) ratings: yes
8
xu1
θv1
xu2
θv2 θv3 θv2
To Solve MF: CD
Coordinate descent (CD) - Similar to ALS
- But update one coordinate of xu and θu at a time
9
xu1
θv1
xu2
θv2 θv3 θv2
To Parallelize ALS
Alternating Least Square (ALS) - Solve xus independently (θvs thereafter)
- Parallelize the solve of xus on multiple nodes
- Replicate ϴ
- Partially replicate ϴ
- Split ϴ on multiple nodes
10
xu1
θv1
xu2
θv2 θv3 θv2
Parallelize SGD: Hogwild!
11
xu1
θv1
xu2
xu3
θv2 θv3
worker 1
worker 2
worker 3
θv1 θv3
update conflict!
Hogwild! [Niu et al. 2011] -- parallel SGD converges despite of (occasional) update conflict
Random sampling hurts cache performance (on GPUs) -- hardware cannot prefetch
Hogwild! Not Good Enough?
12
xu1
θv1
xu2
xu3
θv2 θv3
Parallelize SGD: Matrix blocking
13
1
2
3
4
9 5
6
7
8
10 11
12
3. wave 3 1. wave 1 2. wave 2 0. Divide to blocks
Divide R into blocks, say 4*4
4 workers update 4 “non-overlapping” blocks concurrently
– Workers do not need to communicate
Parallelize SGD: Matrix blocking
14
1
2
3
4
5
6
7
8
1. wave 1 2. wave 2 Cons: all 4 workers need to complete
before the next wave
Solution: more blocks than workers
– 6*6 blocks, 4 workers
Worker 0 can immediate pick up
another block when T0 is done
Cons: scheduling overhead
MF methods with SGD, ALS and CCD
15
Challenge: compute and memory capacity of CPU
CPU offers: 1 T flops, 80 GB/s f=100, per epoch • ALS floating-point operations
- Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T
• SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB
- >> CPU flops and BW capacity
16
GPU vs. CPU: compute FLOPS and memory bandwidth
17 https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
• Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU
Goal: a CUDA library for MF
CUDA
cuMF Kernels for ALS and SGD
PU MF
Word embedding …
Collab. filtering
Fast
Fast training
Update model quickly
Deal with big data
Exploit fast interconnection
Scalable
Cost efficient
Fully utilize flops or BW
Cheaper than CPU solutions
18
Challenges of ALS
• ALS needs to solve many:
• Challenge 1: access and
aggregate many θvs:
memory irregular and
compute intensive
Challenge 2: LU or Cholesky solver: compute intensive
• Challenge 3:
Single GPU can NOT handle big m, n and Nz
19
Challenge 1: improve flops
Nvidia Pascal: Memory BW: 740 GB/sec, compute: 11 Tflops
Higher flops higher op intensity (more flops per byte) caching!
Operational intensity (Flops/Byte)
flo
ps
11 T
1
740 G ×
15
×
under-utilized
fully-utilized
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009) 20
Address challenge 1: memory-optimized ALS
• To obtain
Shared memory
Register
3. tile and aggregate in register
2. stage into smem
1. non-coalesced read
21
Address Challenge 2: exact solver is compute intensive
22
ALS iter 1 Exactly solve X
Exactly solve Θ ALS iter 2
Exactly solve X Exactly solve Θ
ALS iter n Exactly solve X Exactly solve Θ
…
ALS iter 1 Approx. solve X Approx. solve Θ
ALS iter 2 Approx. solve X Approx. solve Θ
ALS iter n Approx. solve X Approx. solve Θ
…
use fs << f O(f 3) O(f 2)
Address Challenge 2: use CG solver
23
Solver time: CG = ¼ LU
CG solver is memory- (instead of compute-) bound
CG w/ FP16 = ½ CG w/ FP32
Address Challenge 3: scale-up ALS on multiple GPUs
Model parallel: solve a
portion of the model
model
parallel
24
Address Challenge 3: scale-up ALS on multiple GPUs
Data parallel: solve
with a portion of the
training data Data parallel
model
parallel
25
Recap: challenges of ALS
• ALS needs to solve many:
• Challenge 1: access and
aggregate many θvs: memory
irregular and compute intensive
-- use register, smem and non-
coalesced read
Challenge 2: LU or Cholesky solver: compute intensive -- use approximate CG solver and FP16
• Challenge 3:
Single GPU can NOT handle big m, n and Nz
-- use model and data parallelism,
and topology aware reduction
26
Connect cuMF to Spark MLlib
Spark applications relying on mllib/ALS need no change
Modified mllib/ALS detects GPU and offload maxtix computation
Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMF JNI
https://github.com/IBMSparkGPU/CUDA-MLlib http://www-01.ibm.com/support/docview.wss?uid=swg21983421
27
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
… RDD RDD RDD
CUDA kernel
GPU2
RDD
CUDA kernel
… RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA kernel
GPU1
RDD
CUDA kernel
… RDD RDD RDD
shuffle
…
RDD on CPU: to distribute rating data and shuffle parameters
Solver on GPU: to form and solve
Able to run on multiple nodes, and multiple GPUs per node
28
Challenges of SGD
• Iterate over all ratings and do
this in sequence:
• Memory bound
29
(a) Hogwild
Parallel worker 0 Parallel worker 1
(b) Matrix Blocking
CacheMemory
Coalescing
Half
precision
Warp shuffleILP
Register
Reuse
2. how to parallelize
1. update kernel
Experiment 1: is cuMF fast and scalable?
30
• cuMF_ALS w/ FP16 on Maxwell and Pascal
• 1 GPU for Netflix and Yahoo
• LIBMF: 1 CPU w/ 40 threads
• NOMAD
• 32 nodes for Netflix and Yahoo
• 64 HPC nodes for Hugewiki
• 2-10x as fast
Experiment 2: cuMF_ALS and cuMF_SGD (on Maxwell)
31
• ALS slightly slower than SGD on single GPU
• On big data set Hugewiki, ALS@4 GPU performs best -- SGD harder to parallel to multiple GPUs!
Experiment 3: is cuMF cost efficient?
• cuMF_ALS @4 Maxwell
≈ 1/10 SparkALS @50 nodes
≈ $2.5/hr 1/10 of 50 nodes
≈ 1% of SparkALS’s cost
32
Conclusion
Why accelerate matrix factorization using GPU?
– MF need to be fast, scalable, and economic
– GPU offers ~10x flops, memory BW, and fast interconnect
How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
– Approximate computing
– Reduced precision
What is the result?
– Implemented ALS and SGD
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone, with Spark or Tensorflow
33
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016
CuMF_SGD: Fast and Scalable Matrix Factorization. CoRR abs/1610.05838, 2016
Code: http://github.com/cuMF/
Blog: http://ibm.biz/cumf-blog
Contact: Wei Tan, [email protected]
34
35