Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María...

Preview:

Citation preview

Optimizing Matrix Multiplication with a Classifier Learning System

Xiaoming Li (presenter)María Jesús Garzarán

University of Illinois at Urbana-Champaign

Tuning library for recursive matrix multiplication

• Use cache-aware algorithms that take into account architectural features– Memory hierarchy– Register file, …

• Take into account input characteristics– matrix sizes

• The process of tuning is automatic.

Recursive Matrix Partitioning

• Previous approaches– Multiple recursive steps– Only divide by half

A B

Recursive Matrix Partitioning

• Previous approaches:– Multiple recursive steps– Only divide by half

A B

Step 1:

Recursive Matrix Partitioning

• Previous approaches:– Multiple recursive steps– Only divide by half

A B

Step 2:

Recursive Matrix Partitioning

• Our approach is more general– No need to divide by half– May use a single step to reach the same partition– Faster and more general

A B

Step 1:

Our approach

• A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine:– Number of partition levels– How to partition at each level

• An intelligent search method based on a classifier learning system– Search for the best partitioning strategy in a

huge search space

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Recursive layout framework

• Multiple levels of recursion– Takes into account the

cache hierarchy

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

Recursive layout framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Recursive layout in our framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Recursive layout framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Recursive layout framework

1 2 5 6 17 18 21 22

3 4 7 8 19 20 23 24

9 10 13 14 25 26 29 30

11 12 15 16 27 28 31 32

33 34 37 38 49 50 53 54

35 36 39 40 51 52 55 56

41 42 45 46 57 58 61 62

43 44 47 48 59 60 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2000 Divide by 3

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2001 Divide by 3

667

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2001 Divide by 4

667

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2004 Divide by 4

668

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

9

8

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

10

8

Padding

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

3

4

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

• Partition by Block (PB)– Specify the size of each tile– Example:

• Dimensions (M,N,K) = (100, 100, 40)• Tile size (bm, bn, bk) = (50, 50, 20)

Partition factors (pm, pn, pk) = (2,2,2)

– Tiles need not to be square

Two methods to partition matrices

bk

kpk

bn

npn

bm

mpm ,,

Two methods to partition matrices

• Partition by Size (PS)– Specify the maximum size of the three tiles.– Maintain the ratios between dimensions constant– Example:

• (M,N,K) = (100, 100,50)• Maximum tile size for M,N = 1250

(pm, pn, pk) = (2,2,1)

– Generalization of the “divide-by-half” approach.• Tile size = 1/4 * matrix size

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Classifier Learning System

• Use the two partition primitives to determine how the input matrices are partitioned– Determine partition factors at each level

f: (M,N,K) (pmi,pni,pki), i=0,1,2 (only consider 3 levels)

• The partition factors depend on the matrix size– Eg. The partitions factors of a (1000 x 1000) matrix

should be different that those of a (50 x 1000) matrix.

• The partition factors also depend on the architectural characteristics, like cache size.

Determine the best partition factors

• The search space is huge exhaustive search is impossible

• Our proposal: use a multi-step classifier learning system– Creates a table that given the matrix

dimensions determines the partition factors

Classifier Learning System

• The result of the classifier learning system is a table with two columns

• Column 1 (Pattern): A string of ‘0’, ‘1’, and ‘*’ that encodes the dimensions of the matrices

• Column 2 (Action): Partition method for one step– Built using the “partition-by-block” and “partition-by-

size” primitives with different parameters.

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

5 bits / dim

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

16

24

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

16

24

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

4

4

How classifier learning algorithm works?

• Change the table based on the feedback of performance and accuracy from previous runs.

• Mutate the condition part of the table to adjust the range of matching matrix dimensions.

• Mutate the action part to find the best partition method for the matching matrices.

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Experimental Results

• Experiments on three platforms– Sun UltraSparcIII– P4 Intel Xeon– Intel Itanium2

• Matrices of sizes from 1000 x 1000 to 5000 x 5000

Algorithms• Classifier MMM: our approach

– Include the overhead of copying in and out of recursive layout

• ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. – Has some type of blocking for L2

• L1: One level of tiling– tile size: the same that ATLAS for L1

• L2: Two levels of tiling– L1tile and L2tile: the same that ATLAS for L1

Conclusion and Future Work

• Preliminary results prove the effectiveness of our approach– Sun UltraSparcIII and Xeon: 18% and 5%

improvement, respectively. – Itanium: -14%

• Need to improve padding mechanism– Reduce the amount of padding– Avoid unnecessary computation on padding

Thank you!

Recommended