44
Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana- Champaign

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Embed Size (px)

Citation preview

Page 1: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Optimizing Matrix Multiplication with a Classifier Learning System

Xiaoming Li (presenter)María Jesús Garzarán

University of Illinois at Urbana-Champaign

Page 2: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Tuning library for recursive matrix multiplication

• Use cache-aware algorithms that take into account architectural features– Memory hierarchy– Register file, …

• Take into account input characteristics– matrix sizes

• The process of tuning is automatic.

Page 3: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive Matrix Partitioning

• Previous approaches– Multiple recursive steps– Only divide by half

A B

Page 4: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive Matrix Partitioning

• Previous approaches:– Multiple recursive steps– Only divide by half

A B

Step 1:

Page 5: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive Matrix Partitioning

• Previous approaches:– Multiple recursive steps– Only divide by half

A B

Step 2:

Page 6: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive Matrix Partitioning

• Our approach is more general– No need to divide by half– May use a single step to reach the same partition– Faster and more general

A B

Step 1:

Page 7: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Our approach

• A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine:– Number of partition levels– How to partition at each level

• An intelligent search method based on a classifier learning system– Search for the best partitioning strategy in a

huge search space

Page 8: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Page 9: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout framework

• Multiple levels of recursion– Takes into account the

cache hierarchy

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

Page 10: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Page 11: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout in our framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Page 12: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout framework

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Page 13: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout framework

1 2 5 6 17 18 21 22

3 4 7 8 19 20 23 24

9 10 13 14 25 26 29 30

11 12 15 16 27 28 31 32

33 34 37 38 49 50 53 54

35 36 39 40 51 52 55 56

41 42 45 46 57 58 61 62

43 44 47 48 59 60 63 64

• Multiple levels of recursion– Takes into account the

cache hierarchy

Page 14: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2000 Divide by 3

Page 15: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2001 Divide by 3

667

Page 16: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2001 Divide by 4

667

Page 17: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Padding

• Necessary when the partition factor is not a divisor of the matrix dimension.

2004 Divide by 4

668

Page 18: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

Page 19: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

9

8

Page 20: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

10

8

Padding

Page 21: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Recursive layout in our framework

• Multiple level recursion– Support cache hierarchy

• Square tile rectangular tile– Fit non-square matrixes

3

4

Page 22: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Page 23: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

• Partition by Block (PB)– Specify the size of each tile– Example:

• Dimensions (M,N,K) = (100, 100, 40)• Tile size (bm, bn, bk) = (50, 50, 20)

Partition factors (pm, pn, pk) = (2,2,2)

– Tiles need not to be square

Two methods to partition matrices

bk

kpk

bn

npn

bm

mpm ,,

Page 24: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Two methods to partition matrices

• Partition by Size (PS)– Specify the maximum size of the three tiles.– Maintain the ratios between dimensions constant– Example:

• (M,N,K) = (100, 100,50)• Maximum tile size for M,N = 1250

(pm, pn, pk) = (2,2,1)

– Generalization of the “divide-by-half” approach.• Tile size = 1/4 * matrix size

Page 25: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Page 26: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Classifier Learning System

• Use the two partition primitives to determine how the input matrices are partitioned– Determine partition factors at each level

f: (M,N,K) (pmi,pni,pki), i=0,1,2 (only consider 3 levels)

• The partition factors depend on the matrix size– Eg. The partitions factors of a (1000 x 1000) matrix

should be different that those of a (50 x 1000) matrix.

• The partition factors also depend on the architectural characteristics, like cache size.

Page 27: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Determine the best partition factors

• The search space is huge exhaustive search is impossible

• Our proposal: use a multi-step classifier learning system– Creates a table that given the matrix

dimensions determines the partition factors

Page 28: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Classifier Learning System

• The result of the classifier learning system is a table with two columns

• Column 1 (Pattern): A string of ‘0’, ‘1’, and ‘*’ that encodes the dimensions of the matrices

• Column 2 (Action): Partition method for one step– Built using the “partition-by-block” and “partition-by-

size” primitives with different parameters.

Page 29: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

Page 30: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

5 bits / dim

Page 31: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

16

24

Page 32: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

16

24

Page 33: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Page 34: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Page 35: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

8

12

Page 36: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Learn with Classifier System

Pattern Action

(10***,11***) PS 100

… …

(010**,011**) PB (4,4)

4

4

Page 37: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

How classifier learning algorithm works?

• Change the table based on the feedback of performance and accuracy from previous runs.

• Mutate the condition part of the table to adjust the range of matching matrix dimensions.

• Mutate the action part to find the best partition method for the matching matrices.

Page 38: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Outline

• Background

• Partition Methods

• Classifier Learning System

• Experimental Results

Page 39: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Experimental Results

• Experiments on three platforms– Sun UltraSparcIII– P4 Intel Xeon– Intel Itanium2

• Matrices of sizes from 1000 x 1000 to 5000 x 5000

Page 40: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Algorithms• Classifier MMM: our approach

– Include the overhead of copying in and out of recursive layout

• ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. – Has some type of blocking for L2

• L1: One level of tiling– tile size: the same that ATLAS for L1

• L2: Two levels of tiling– L1tile and L2tile: the same that ATLAS for L1

Page 41: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign
Page 42: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign
Page 43: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Conclusion and Future Work

• Preliminary results prove the effectiveness of our approach– Sun UltraSparcIII and Xeon: 18% and 5%

improvement, respectively. – Itanium: -14%

• Need to improve padding mechanism– Reduce the amount of padding– Avoid unnecessary computation on padding

Page 44: Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Thank you!