35
Wei Tan IBM T. J. Watson Research Center [email protected] http://github.com/cumf http://researcher.ibm.com/person/us-wtan Large Scale Matrix Factorization: Systems and Acceleration

Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Wei Tan

IBM T. J. Watson Research Center

[email protected]

http://github.com/cumf

http://researcher.ibm.com/person/us-wtan

Large Scale Matrix Factorization: Systems and Acceleration

Page 2: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Agenda

Fei’s talk covers the formalism/theory/math of MF

My talk focuses on “how to run it fast, scalable and cost-efficient”

– Matrix factorization, SGD and ALS (10 min)

– Parallelize and accelerate SGD and ALS (20 min)

– GPU accelerated SGD and ALS (20 min)

– Conclusion and QA (10 min)

2

Page 3: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Matrix Factorization

3

Ratings (R)

n items

m users

* * * *

*

*

*

*

x

Users

items

T

xT

u

vX

f

f

Page 4: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

MF Explained using Recommender Systems

How: factorize the rating matrix R into

and minimize the empirical lost:

R X

xu

ΘT θv

Input: users ratings on some items

Output: user/item features

Use: predict missing ratings; use

features for other tasks (e.g. clustering)

X

ΘT

word

word X

ΘT

word

docum

ent

Topic model Word embedding

4

Page 5: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Matrix Factorization is a Key

Predict missing ratings

Group similar users/items

Match query and document In machine learning and HPC applications

Matrix Factorization Link prediction

Vertices clustering

Latent semantic model

Word embedding as input to DNN

Recommender systems

Complex network

Web search

Natural language processing

Tensor decomposition

Model compression

Embedding layer

Deep learning

Supported in cuMF

To be supported

5

Ratings (R)

n items

mu

sers

* * **

*

*

*

*

x

Users

items

T

xT

u

vX

f

f

Page 6: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Challenge: MF needs to be fast, scalable, economic

6

Fast – Recommend/update timely

Scalable – Facebook: 100 B ratings, 1 B users

Economic – Avoid large infrastructure

Page 7: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Stochastic gradient descent (SGD) - Update takes one rating at a time

- Vector inner product: memory bound

- Need many light epochs

- Parallelize: non-trivial

- Handle dense (implicit) ratings: no

To Solve MF: SGD

7

xu1

θv1

xu2

xu3

θv2 θv3

Page 8: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

To Solve MF: ALS

Alternating Least Square (ALS) - Update takes ALL rating at a time

- Vector outer product & solve: compute bound

- Need few heavy epochs

- Parallelize: straightforward

- Handle dense (implicit) ratings: yes

8

xu1

θv1

xu2

θv2 θv3 θv2

Page 9: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

To Solve MF: CD

Coordinate descent (CD) - Similar to ALS

- But update one coordinate of xu and θu at a time

9

xu1

θv1

xu2

θv2 θv3 θv2

Page 10: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

To Parallelize ALS

Alternating Least Square (ALS) - Solve xus independently (θvs thereafter)

- Parallelize the solve of xus on multiple nodes

- Replicate ϴ

- Partially replicate ϴ

- Split ϴ on multiple nodes

10

xu1

θv1

xu2

θv2 θv3 θv2

Page 11: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Parallelize SGD: Hogwild!

11

xu1

θv1

xu2

xu3

θv2 θv3

worker 1

worker 2

worker 3

θv1 θv3

update conflict!

Hogwild! [Niu et al. 2011] -- parallel SGD converges despite of (occasional) update conflict

Page 12: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Random sampling hurts cache performance (on GPUs) -- hardware cannot prefetch

Hogwild! Not Good Enough?

12

xu1

θv1

xu2

xu3

θv2 θv3

Page 13: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Parallelize SGD: Matrix blocking

13

1

2

3

4

9 5

6

7

8

10 11

12

3. wave 3 1. wave 1 2. wave 2 0. Divide to blocks

Divide R into blocks, say 4*4

4 workers update 4 “non-overlapping” blocks concurrently

– Workers do not need to communicate

Page 14: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Parallelize SGD: Matrix blocking

14

1

2

3

4

5

6

7

8

1. wave 1 2. wave 2 Cons: all 4 workers need to complete

before the next wave

Solution: more blocks than workers

– 6*6 blocks, 4 workers

Worker 0 can immediate pick up

another block when T0 is done

Cons: scheduling overhead

Page 15: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

MF methods with SGD, ALS and CCD

15

Page 16: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Challenge: compute and memory capacity of CPU

CPU offers: 1 T flops, 80 GB/s f=100, per epoch • ALS floating-point operations

- Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T

• SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB

- >> CPU flops and BW capacity

16

Page 17: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

GPU vs. CPU: compute FLOPS and memory bandwidth

17 https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

• Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU

Page 18: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Goal: a CUDA library for MF

CUDA

cuMF Kernels for ALS and SGD

PU MF

Word embedding …

Collab. filtering

Fast

Fast training

Update model quickly

Deal with big data

Exploit fast interconnection

Scalable

Cost efficient

Fully utilize flops or BW

Cheaper than CPU solutions

18

Page 19: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Challenges of ALS

• ALS needs to solve many:

• Challenge 1: access and

aggregate many θvs:

memory irregular and

compute intensive

Challenge 2: LU or Cholesky solver: compute intensive

• Challenge 3:

Single GPU can NOT handle big m, n and Nz

19

Page 20: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Challenge 1: improve flops

Nvidia Pascal: Memory BW: 740 GB/sec, compute: 11 Tflops

Higher flops higher op intensity (more flops per byte) caching!

Operational intensity (Flops/Byte)

flo

ps

11 T

1

740 G ×

15

×

under-utilized

fully-utilized

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009) 20

Page 21: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Address challenge 1: memory-optimized ALS

• To obtain

Shared memory

Register

3. tile and aggregate in register

2. stage into smem

1. non-coalesced read

21

Page 22: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Address Challenge 2: exact solver is compute intensive

22

ALS iter 1 Exactly solve X

Exactly solve Θ ALS iter 2

Exactly solve X Exactly solve Θ

ALS iter n Exactly solve X Exactly solve Θ

ALS iter 1 Approx. solve X Approx. solve Θ

ALS iter 2 Approx. solve X Approx. solve Θ

ALS iter n Approx. solve X Approx. solve Θ

use fs << f O(f 3) O(f 2)

Page 23: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Address Challenge 2: use CG solver

23

Solver time: CG = ¼ LU

CG solver is memory- (instead of compute-) bound

CG w/ FP16 = ½ CG w/ FP32

Page 24: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Address Challenge 3: scale-up ALS on multiple GPUs

Model parallel: solve a

portion of the model

model

parallel

24

Page 25: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Address Challenge 3: scale-up ALS on multiple GPUs

Data parallel: solve

with a portion of the

training data Data parallel

model

parallel

25

Page 26: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Recap: challenges of ALS

• ALS needs to solve many:

• Challenge 1: access and

aggregate many θvs: memory

irregular and compute intensive

-- use register, smem and non-

coalesced read

Challenge 2: LU or Cholesky solver: compute intensive -- use approximate CG solver and FP16

• Challenge 3:

Single GPU can NOT handle big m, n and Nz

-- use model and data parallelism,

and topology aware reduction

26

Page 27: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Connect cuMF to Spark MLlib

Spark applications relying on mllib/ALS need no change

Modified mllib/ALS detects GPU and offload maxtix computation

Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps

mllib/ALS

cuMF JNI

https://github.com/IBMSparkGPU/CUDA-MLlib http://www-01.ibm.com/support/docview.wss?uid=swg21983421

27

Page 28: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

… RDD RDD RDD

CUDA kernel

GPU2

RDD

CUDA kernel

… RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

… RDD RDD RDD

shuffle

RDD on CPU: to distribute rating data and shuffle parameters

Solver on GPU: to form and solve

Able to run on multiple nodes, and multiple GPUs per node

28

Page 29: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Challenges of SGD

• Iterate over all ratings and do

this in sequence:

• Memory bound

29

(a) Hogwild

Parallel worker 0 Parallel worker 1

(b) Matrix Blocking

CacheMemory

Coalescing

Half

precision

Warp shuffleILP

Register

Reuse

2. how to parallelize

1. update kernel

Page 30: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Experiment 1: is cuMF fast and scalable?

30

• cuMF_ALS w/ FP16 on Maxwell and Pascal

• 1 GPU for Netflix and Yahoo

• LIBMF: 1 CPU w/ 40 threads

• NOMAD

• 32 nodes for Netflix and Yahoo

• 64 HPC nodes for Hugewiki

• 2-10x as fast

Page 31: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Experiment 2: cuMF_ALS and cuMF_SGD (on Maxwell)

31

• ALS slightly slower than SGD on single GPU

• On big data set Hugewiki, ALS@4 GPU performs best -- SGD harder to parallel to multiple GPUs!

Page 32: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Experiment 3: is cuMF cost efficient?

• cuMF_ALS @4 Maxwell

≈ 1/10 SparkALS @50 nodes

≈ $2.5/hr 1/10 of 50 nodes

≈ 1% of SparkALS’s cost

32

Page 33: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Conclusion

Why accelerate matrix factorization using GPU?

– MF need to be fast, scalable, and economic

– GPU offers ~10x flops, memory BW, and fast interconnect

How cuMF tackles the challenges?

– Optimize memory access, parallelism and communication

– Approximate computing

– Reduced precision

What is the result?

– Implemented ALS and SGD

– Up to 10x as fast, 100x as cost-efficient

– Use cuMF standalone, with Spark or Tensorflow

33

Page 34: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016

CuMF_SGD: Fast and Scalable Matrix Factorization. CoRR abs/1610.05838, 2016

Code: http://github.com/cuMF/

Blog: http://ibm.biz/cumf-blog

Contact: Wei Tan, [email protected]

34

Page 35: Large Scale Matrix Factorization: Systems and Accelerationbigdataieee.org › BigData2016 › files › Tutorial3-2.pdf · Large Scale Matrix Factorization: Systems and Acceleration

35