32
0 5000 10000 15000 0 5 10 15 20 25 Dimension (N) Effective GFLOPS / core Parallel performance of Strassen on <N,N,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION Austin Benson ([email protected]), ICME, Stanford Grey Ballard, Sandia National Laboratories BLIS Retreat, September 26, 2014 arXiv: 1409.2908 1

A framework for practical fast matrix multiplication (BLIS retreat)

Embed Size (px)

DESCRIPTION

Slides from my talk at the BLIS retreat in Austin, TX.

Citation preview

Page 1: A framework for practical fast matrix multiplication (BLIS retreat)

0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

1

Page 2: A framework for practical fast matrix multiplication (BLIS retreat)

Fast matrix multiplication:

bridging theory and practice

• There are a number of Strassen-like algorithms for matrix

multiplication that have only been “discovered” recently.

[Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with

respect to MKL (sequential and sometimes in parallel).

• We use code generation to do extensive prototyping. There

are several practical issues, and there is plenty of room for

improvement (lots of expertise at UT to help here!)

2

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Page 3: A framework for practical fast matrix multiplication (BLIS retreat)

Strassen’s algorithm

3

Page 4: A framework for practical fast matrix multiplication (BLIS retreat)

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)

• 2. Seven linear combinations of sub-blocks of A

• 3. Seven linear combinations of sub-blocks of B

• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

4

Page 5: A framework for practical fast matrix multiplication (BLIS retreat)

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)

• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B

• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

5

Page 6: A framework for practical fast matrix multiplication (BLIS retreat)

“Outer product” fast algorithm

• <4, 2, 4> partitioning

• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)

• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

6

Page 7: A framework for practical fast matrix multiplication (BLIS retreat)

Discovering fast algorithms is a

numerical challenge

• Low-rank tensor decompositions lead to fast algorithms

• Tensors are small, but we need exact decompositions

NP-hard

• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N>

decompositions. Also have permutations, e.g., <K, M, N>.

7

Page 8: A framework for practical fast matrix multiplication (BLIS retreat)

8

[Smirnov13]

[Strassen69]

Page 9: A framework for practical fast matrix multiplication (BLIS retreat)

Code generation lets us prototype

algorithms quickly

• We have compact representation of many fast algorithms:

1. dimensions of block partitioning (<M, K, N>)

2. linear combinations of sub-blocks (Sr, Tr)

3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

• Our approach: test all algorithms on a bunch of different

problem sizes and look for patterns

9

Page 10: A framework for practical fast matrix multiplication (BLIS retreat)

Practical issues

• Best way to do matrix additions? (in paper)

• Can we eliminate redundant linear combinations? (in paper)

• Different problem shapes other than square (this talk)

• When to stop recursion? (this talk)

• How to parallelize? (this talk)

10

=

Page 11: A framework for practical fast matrix multiplication (BLIS retreat)

0 1000 2000 300010

15

20

25

Dimension (N)

GF

LO

PS

Sequential dgemm performance

N x 800 x 800

N x 800 x N

N x N x N

peak

0 2000 4000 6000 800010

15

20

25

Dimension (N)G

FLO

PS

/ c

ore

Parallel dgemm performance (24 cores)

Recursion cutoff: look at gemm curve

Basic idea: take another

recursive step if the sub-

problems will still operate at

high performance

11

<M, K, N> = <4, 2, 3>

Page 12: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance

0 2000 4000 6000 800016

18

20

22

24

26

28

Dimension (N)

Effe

ctive

GF

LO

PS

Sequential performance on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

=

12

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

True peak

Page 13: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance

0 2000 4000 6000 800016

18

20

22

24

26

28

Dimension (N)

Effe

ctive

GF

LO

PS

Sequential performance on N x N x N

MKL

STRASSEN

<4,4,2>

<4,3,3>

<3,4,3>

<3,3,6>

<3,6,3>

<6,3,3>

• All algorithms beat MKL on large problems

• Strassen’s algorithm is hard to beat

=

13

Page 14: A framework for practical fast matrix multiplication (BLIS retreat)

2000 4000 6000 8000 10000 1200022

23

24

25

26

27

dimension (N)

Eff

ective

GF

LO

PS

Sequential performance on N x 1600 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Sequential performance =

• Almost all algorithms beat MKL

• <4, 2, 4> and <3, 2, 3> tend to perform the best

14

Page 15: A framework for practical fast matrix multiplication (BLIS retreat)

Sequential performance =

• Almost all algorithms beat MKL

• <4, 3, 3> and <4, 2, 3> tend to perform the best

15

10000 12000 14000 16000 1800022

23

24

25

26

dimension (N)

Eff

ective

GF

LO

PS

Sequential performance on N x 2400 x 2400

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Page 16: A framework for practical fast matrix multiplication (BLIS retreat)

Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

M1 M7

+

M2…

16

Page 17: A framework for practical fast matrix multiplication (BLIS retreat)

DFS Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

All threads

Use parallel MKL

+ Easy to implement

+ Load balanced

+ Same memory

footprint as sequential

- Need large base

cases for high

performance

17

Page 18: A framework for practical fast matrix multiplication (BLIS retreat)

BFS Parallelization

C

M1 M7

+

M2…

M1 M7

+

M2 …

omp taskwait

omp taskwait

1 thread

+ High performance for smaller base cases

- Sometimes harder to load balance: 24 threads, 49 subproblems

- More memory

1 thread 1 thread

18

Page 19: A framework for practical fast matrix multiplication (BLIS retreat)

HYBRID parallelization

C

M1 M7

+

M2…

M1 M7

+

M2…

omp taskwait

omp taskwait

1 thread 1 thread all threads

+ Better load balancing

- Explicit synchronization or else we can over-subscribe threads

19

Page 20: A framework for practical fast matrix multiplication (BLIS retreat)

20

0 5000 10000 15000 20000 0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Effe

ctive

GF

LO

PS

/ c

ore

Parallel performance of <4,2,4> on <N,2800,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

Page 21: A framework for practical fast matrix multiplication (BLIS retreat)

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions

• Parallel dgemm on 24 cores: easily get 50-75% of peak

• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

C

M1 M7

+

M2…

21

Page 22: A framework for practical fast matrix multiplication (BLIS retreat)

Parallel performance =

• 6 cores: similar performance to sequential

• 24 cores: can sometimes beat MKL, but barely

22

9000 10000 11000 12000 1300018

20

22

24

26

28

Dimension (N)

Effe

ctive G

FL

OP

S /

core

Performance (6 cores) on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4> 9000 10000 11000 12000 1300016

18

20

22

Dimension (N)

Effe

ctive G

FL

OP

S /

core

Performance (24 cores) on N x N x N

MKL

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

Page 23: A framework for practical fast matrix multiplication (BLIS retreat)

10000 15000 20000 10000 1500018

19

20

21

22

23

24

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (6 cores) on N x 2800 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

5000 10000 15000 2000012

14

16

18

20

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (24 cores) on N x 2800 x N

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Parallel performance =Bad MKL

performance

• 6 cores: similar performance to sequential

• 24 cores: MKL best for large problems

23

Page 24: A framework for practical fast matrix multiplication (BLIS retreat)

Parallel performance =

• 6 cores: similar performance to sequential

• 24 cores: MKL usually the best

24

10000 15000 20000 10000 1500018

19

20

21

22

23

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (6 cores) on N x 3000 x 3000

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

16000 18000 20000 22000 24000 2600012

13

14

15

16

17

18

dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Performance (24 cores) on N x 3000 x 3000

MKL

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Page 25: A framework for practical fast matrix multiplication (BLIS retreat)

High-level conclusions

• For square matrix multiplication, Strassen’s algorithm is

hard to beat

• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”

• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication

should be less of an issue in distributed memory

Future work:

• Numerical stability

• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

25

Page 26: A framework for practical fast matrix multiplication (BLIS retreat)

0 5000 10000 150000

5

10

15

20

25

Dimension (N)

Eff

ective

GF

LO

PS

/ c

ore

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

26

Page 27: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Pairwise”

2x

DAXPY

2x

DAXPY

27

Page 28: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Write once”

custom

“DAXPY”

custom

“DAXPY”

28

Page 29: A framework for practical fast matrix multiplication (BLIS retreat)

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Streaming”

Entry-wise

updates

29

Page 30: A framework for practical fast matrix multiplication (BLIS retreat)

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Four additions, six reads, two writes

30

Page 31: A framework for practical fast matrix multiplication (BLIS retreat)

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Y

Three additions, six reads, three writes

Net increase in communication!

31

Page 32: A framework for practical fast matrix multiplication (BLIS retreat)

CSE does not really help

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

32