View
214
Download
0
Category
Tags:
Preview:
Citation preview
Optimizing Matrix Multiplication with a Classifier Learning System
Xiaoming Li (presenter)María Jesús Garzarán
University of Illinois at Urbana-Champaign
Tuning library for recursive matrix multiplication
• Use cache-aware algorithms that take into account architectural features– Memory hierarchy– Register file, …
• Take into account input characteristics– matrix sizes
• The process of tuning is automatic.
Recursive Matrix Partitioning
• Previous approaches– Multiple recursive steps– Only divide by half
A B
Recursive Matrix Partitioning
• Previous approaches:– Multiple recursive steps– Only divide by half
A B
Step 1:
Recursive Matrix Partitioning
• Previous approaches:– Multiple recursive steps– Only divide by half
A B
Step 2:
Recursive Matrix Partitioning
• Our approach is more general– No need to divide by half– May use a single step to reach the same partition– Faster and more general
A B
Step 1:
Our approach
• A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine:– Number of partition levels– How to partition at each level
• An intelligent search method based on a classifier learning system– Search for the best partitioning strategy in a
huge search space
Outline
• Background
• Partition Methods
• Classifier Learning System
• Experimental Results
Recursive layout framework
• Multiple levels of recursion– Takes into account the
cache hierarchy
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
Recursive layout framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
• Multiple levels of recursion– Takes into account the
cache hierarchy
Recursive layout in our framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
• Multiple levels of recursion– Takes into account the
cache hierarchy
Recursive layout framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
• Multiple levels of recursion– Takes into account the
cache hierarchy
Recursive layout framework
1 2 5 6 17 18 21 22
3 4 7 8 19 20 23 24
9 10 13 14 25 26 29 30
11 12 15 16 27 28 31 32
33 34 37 38 49 50 53 54
35 36 39 40 51 52 55 56
41 42 45 46 57 58 61 62
43 44 47 48 59 60 63 64
• Multiple levels of recursion– Takes into account the
cache hierarchy
Padding
• Necessary when the partition factor is not a divisor of the matrix dimension.
2000 Divide by 3
Padding
• Necessary when the partition factor is not a divisor of the matrix dimension.
2001 Divide by 3
667
Padding
• Necessary when the partition factor is not a divisor of the matrix dimension.
2001 Divide by 4
667
Padding
• Necessary when the partition factor is not a divisor of the matrix dimension.
2004 Divide by 4
668
Recursive layout in our framework
• Multiple level recursion– Support cache hierarchy
• Square tile rectangular tile– Fit non-square matrixes
Recursive layout in our framework
• Multiple level recursion– Support cache hierarchy
• Square tile rectangular tile– Fit non-square matrixes
9
8
Recursive layout in our framework
• Multiple level recursion– Support cache hierarchy
• Square tile rectangular tile– Fit non-square matrixes
10
8
Padding
Recursive layout in our framework
• Multiple level recursion– Support cache hierarchy
• Square tile rectangular tile– Fit non-square matrixes
3
4
Outline
• Background
• Partition Methods
• Classifier Learning System
• Experimental Results
• Partition by Block (PB)– Specify the size of each tile– Example:
• Dimensions (M,N,K) = (100, 100, 40)• Tile size (bm, bn, bk) = (50, 50, 20)
Partition factors (pm, pn, pk) = (2,2,2)
– Tiles need not to be square
Two methods to partition matrices
bk
kpk
bn
npn
bm
mpm ,,
Two methods to partition matrices
• Partition by Size (PS)– Specify the maximum size of the three tiles.– Maintain the ratios between dimensions constant– Example:
• (M,N,K) = (100, 100,50)• Maximum tile size for M,N = 1250
(pm, pn, pk) = (2,2,1)
– Generalization of the “divide-by-half” approach.• Tile size = 1/4 * matrix size
Outline
• Background
• Partition Methods
• Classifier Learning System
• Experimental Results
Classifier Learning System
• Use the two partition primitives to determine how the input matrices are partitioned– Determine partition factors at each level
f: (M,N,K) (pmi,pni,pki), i=0,1,2 (only consider 3 levels)
• The partition factors depend on the matrix size– Eg. The partitions factors of a (1000 x 1000) matrix
should be different that those of a (50 x 1000) matrix.
• The partition factors also depend on the architectural characteristics, like cache size.
Determine the best partition factors
• The search space is huge exhaustive search is impossible
• Our proposal: use a multi-step classifier learning system– Creates a table that given the matrix
dimensions determines the partition factors
Classifier Learning System
• The result of the classifier learning system is a table with two columns
• Column 1 (Pattern): A string of ‘0’, ‘1’, and ‘*’ that encodes the dimensions of the matrices
• Column 2 (Action): Partition method for one step– Built using the “partition-by-block” and “partition-by-
size” primitives with different parameters.
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
5 bits / dim
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
16
24
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
16
24
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
8
12
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
8
12
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
8
12
Learn with Classifier System
Pattern Action
(10***,11***) PS 100
… …
(010**,011**) PB (4,4)
4
4
How classifier learning algorithm works?
• Change the table based on the feedback of performance and accuracy from previous runs.
• Mutate the condition part of the table to adjust the range of matching matrix dimensions.
• Mutate the action part to find the best partition method for the matching matrices.
Outline
• Background
• Partition Methods
• Classifier Learning System
• Experimental Results
Experimental Results
• Experiments on three platforms– Sun UltraSparcIII– P4 Intel Xeon– Intel Itanium2
• Matrices of sizes from 1000 x 1000 to 5000 x 5000
Algorithms• Classifier MMM: our approach
– Include the overhead of copying in and out of recursive layout
• ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. – Has some type of blocking for L2
• L1: One level of tiling– tile size: the same that ATLAS for L1
• L2: Two levels of tiling– L1tile and L2tile: the same that ATLAS for L1
Conclusion and Future Work
• Preliminary results prove the effectiveness of our approach– Sun UltraSparcIII and Xeon: 18% and 5%
improvement, respectively. – Itanium: -14%
• Need to improve padding mechanism– Reduce the amount of padding– Avoid unnecessary computation on padding
Thank you!
Recommended