View
217
Download
0
Category
Preview:
Citation preview
Introduction PetaBricks OpenTuner Conclusions
Autotuning Programs with Algorithmic Choice
Jason Ansel
MIT - CSAIL
December 18, 2013
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Performance search space:
Para
llelis
m c
ho
ices
Accuracy choices
A
lgorithmic choices
• Parallelism
• Exploiting parallelism isnecessary but not sufficient
• Performance is amulti-dimensional search problem
• Normally done by expertprogrammers
• Optimization decisions oftenchange program results
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Performance search space:
Para
llelis
m c
ho
ices
Accuracy choices
A
lgorithmic choices
• Parallelism Performance• Exploiting parallelism is
necessary but not sufficient
• Performance is amulti-dimensional search problem
• Normally done by expertprogrammers
• Optimization decisions oftenchange program results
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Performance search space:
Para
llelis
m c
ho
ices
Accuracy choices
A
lgorithmic choices
• Parallelism Performance• Exploiting parallelism is
necessary but not sufficient
• Performance is amulti-dimensional search problem
• Normally done by expertprogrammers
• Optimization decisions oftenchange program results
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Goal of this workTo automate the process of program optimization to createprograms that can adapt to changing environments and goals.
• Language level solutions for concisely representing algorithmicchoice spaces.
• Processes and compilation techniques to manage and explorethese spaces.
• Autotuning techniques to efficiently solve these searchproblems.
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Goal of this workTo automate the process of program optimization to createprograms that can adapt to changing environments and goals.
• Language level solutions for concisely representing algorithmicchoice spaces.
• Processes and compilation techniques to manage and explorethese spaces.
• Autotuning techniques to efficiently solve these searchproblems.
Introduction PetaBricks OpenTuner Conclusions
Research Covered in This Talk
• The PetaBricks programming language: algorithmic choice atthe language level [PLDI’09]
• Language level support for variable accuracy [CGO’11]
• Automated construction of multigrid V-cycles [SC’09]
• Code generation and autotuning for heterogeneous CPU/GPUmix of parallel processing units [ASPLOS’13]
• Solution for input sensitivity based on adaptiveoverhead-aware classifiers [Under review]
• OpenTuner: an extensible framework for program autotuning[Under review]
• Won’t be talking about work in: ASPLOS’09, ASPLOS’12,GECCO’11, IPDPS’09, PLDI’11, and many others
Introduction PetaBricks OpenTuner Conclusions
Research Covered in This Talk
• The PetaBricks programming language: algorithmic choice atthe language level [PLDI’09]
• Language level support for variable accuracy [CGO’11]
• Automated construction of multigrid V-cycles [SC’09]
• Code generation and autotuning for heterogeneous CPU/GPUmix of parallel processing units [ASPLOS’13]
• Solution for input sensitivity based on adaptiveoverhead-aware classifiers [Under review]
• OpenTuner: an extensible framework for program autotuning[Under review]
• Won’t be talking about work in: ASPLOS’09, ASPLOS’12,GECCO’11, IPDPS’09, PLDI’11, and many others
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
• How would you write a fast sorting algorithm?
• Insertion sort• Quick sort• Merge sort• Radix sort
• Poly-algorithms
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
• How would you write a fast sorting algorithm?• Insertion sort• Quick sort• Merge sort• Radix sort
• Poly-algorithms
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
• How would you write a fast sorting algorithm?• Insertion sort• Quick sort• Merge sort• Radix sort
• Poly-algorithms
Introduction PetaBricks OpenTuner Conclusions
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions
Why 15?
• Why 15?
• Dates back to at least 2000 (June 2000 SGI release)
• Still in current C++ STL shipped with GCC
• cutoff = 15 survived 10+ years
• In the source code for millions of C++ programs
• There is nothing the compiler can do about it
Introduction PetaBricks OpenTuner Conclusions
Why 15?
• Why 15?
• Dates back to at least 2000 (June 2000 SGI release)
• Still in current C++ STL shipped with GCC
• cutoff = 15 survived 10+ years
• In the source code for millions1 of C++ programs
• There is nothing the compiler can do about it
1Any C++ program with “#include <algorithm>”, conservative estimate based on:
http://c2.com/cgi/wiki?ProgrammingLanguageUsageStatistics
Introduction PetaBricks OpenTuner Conclusions
Is 15 The Right Number?
• The best cutoff (CO) changes
• Depends on competing costs:• Cost of computation (< operator, call overhead, etc)• Cost of communication (swaps)• Cache behavior (misses, prefetcher, locality)
• Sorting 100000 doubles with std::stable sort:• CO ≈ 200 optimal on a Phenom 905e (15% speedup)• CO ≈ 400 optimal on a Opteron 6168 (15% speedup)• CO ≈ 500 optimal on a Xeon E5320 (34% speedup)• CO ≈ 700 optimal on a Xeon X5460 (25% speedup)
• If the best cutoff has changed, perhaps best algorithm hasalso changed
Introduction PetaBricks OpenTuner Conclusions
Is 15 The Right Number?
• The best cutoff (CO) changes
• Depends on competing costs:• Cost of computation (< operator, call overhead, etc)• Cost of communication (swaps)• Cache behavior (misses, prefetcher, locality)
• Sorting 100000 doubles with std::stable sort:• CO ≈ 200 optimal on a Phenom 905e (15% speedup)• CO ≈ 400 optimal on a Opteron 6168 (15% speedup)• CO ≈ 500 optimal on a Xeon E5320 (34% speedup)• CO ≈ 700 optimal on a Xeon X5460 (25% speedup)
• If the best cutoff has changed, perhaps best algorithm hasalso changed
Introduction PetaBricks OpenTuner Conclusions
Algorithmic Choice
• Compiler’s hands are tied, it is stuck with 15
• Need a better way to represent algorithmic choices
• PetaBricks is the first language with support for algorithmicchoice
Introduction PetaBricks OpenTuner Conclusions
Sort in PetaBricks
Language
funct ion S o r tto out [ n ]from i n [ n ]{
e i t h e r {I n s e r t i o n S o r t ( out , i n ) ;
} or {Q u i c k S o r t ( out , i n ) ;
} or {MergeSort ( out , i n ) ;
} or {R a d i x S o r t ( out , i n ) ;
}}
⇒Representation
Decision treesynthesized by ourautotuner
Introduction PetaBricks OpenTuner Conclusions
Sort in PetaBricks
Language
funct ion S o r tto out [ n ]from i n [ n ]{
e i t h e r {I n s e r t i o n S o r t ( out , i n ) ;
} or {Q u i c k S o r t ( out , i n ) ;
} or {MergeSort ( out , i n ) ;
} or {R a d i x S o r t ( out , i n ) ;
}}
⇒Representation
Decision treesynthesized by ourautotuner
Introduction PetaBricks OpenTuner Conclusions
Decision Trees
Optimized for a Xeon E7340 (8 cores):
N < 600
N < 1420Insertion Sort
Quick Sort Merge Sort(2-way)
Introduction PetaBricks OpenTuner Conclusions
Decision Trees
Optimized for Sun Fire T200 Niagara (8 cores):
N < 1461
N < 2400
Merge Sort(4-way)
Merge Sort(2-way)
N < 75
Merge Sort(8-way)
Merge Sort(16-way)
Introduction PetaBricks OpenTuner Conclusions
Sort Algorithm Timings2
0
0.0005
0.001
0.0015
0.002
0.0025
0 250 500 750 1000 1250 1500 1750
Tim
e (s
)
Input Size
InsertionSortQuickSortMergeSortRadixSortAutotuned
2On an 8-way Xeon E7340 system
Introduction PetaBricks OpenTuner Conclusions
Iteration Order Choices
• Many other choices related to execution order• By rows?• By columns?• Diagonal? Reverse order? Blocked?• Parallel?
• Choices both within a single (possibly parallel)task and between different tasks
• This is main motivation for a new language asopposed to a library
Introduction PetaBricks OpenTuner Conclusions
Iteration Order Choices
• Many other choices related to execution order• By rows?• By columns?• Diagonal? Reverse order? Blocked?• Parallel?
• Choices both within a single (possibly parallel)task and between different tasks
• This is main motivation for a new language asopposed to a library
Introduction PetaBricks OpenTuner Conclusions
Synthesized Outer Control Flow
• PetaBricks programs have synthesized outer control flow• Declarative (data flow like) outer syntax• Imperative inner code
• Programs start as completely parallel
• Added dependencies restrict the space of legal executions
• May only access data explicitly depended on
Parallel loop
X . c e l l ( i ) from ( ) { . . . }
Sequential loop
X . c e l l ( i ) from (X . c e l l ( i −1) l e f t ) { . . . }
Introduction PetaBricks OpenTuner Conclusions
Matrix Multiply
transform M a t r i x M u l t i p l yto AB[ w, h ]from A [ c , h ] , B [ w, c ]{
AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b ){return dot ( a , b ) ;
}
}
to (AB. reg ion ( x , y , x + 4 , y + 4) out )from (A . reg ion ( 0 , y , c , y + 4) a ,
B . reg ion ( x , 0 , x + 4 , c ) b ){// . . . compute 4 x 4 b l o c k . . .
}}
Introduction PetaBricks OpenTuner Conclusions
Matrix Multiply
transform M a t r i x M u l t i p l yto AB[ w, h ]from A [ c , h ] , B [ w, c ]{
AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b ){return dot ( a , b ) ;
}
to (AB. reg ion ( x , y , x + 4 , y + 4) out )from (A . reg ion ( 0 , y , c , y + 4) a ,
B . reg ion ( x , 0 , x + 4 , c ) b ){// . . . compute 4 x 4 b l o c k . . .
}}
Introduction PetaBricks OpenTuner Conclusions
Strassen Matrix Multiply
t r a n s f o r m S t r a s s e nto AB[ n , n ]from A[ n , n ] , B [ n , n ]u s i n g M1[ n/2 , n /2 ] , M2[ n /2 , n /2 ] , M3[ n /2 , n /2 ] , M4[ n /2 , n /2 ] ,
M5[ n /2 , n /2 ] , M6[ n /2 , n /2 ] , M7[ n /2 , n /2 ]{
to (M1 m1)from (A . r e g i o n (0 , 0 , n /2 , n /2) a11 ,
A . r e g i o n ( n /2 , n /2 , n , n ) a22 ,B . r e g i o n (0 , 0 , n /2 , n /2) b11 ,B . r e g i o n ( n /2 , n /2 , n , n ) b22 )
u s i n g ( t1 [ n / 2 , n / 2 ] , t2 [ n /2 , n / 2 ] ) {spawn MatrixAdd ( t1 , a11 , a22 ) ;spawn MatrixAdd ( t2 , b11 , b22 ) ;sync ;S t r a s s e n (m1, t1 , t2 ) ;
}. . . .// Compute one quadrant o f output w i th s t r a s s e n decompos i t i onto (AB. r e g i o n ( n /2 , 0 , n , n /2) c12 ) from (M3 m3, M5 m5){
MatrixAdd ( c12 , m3, m5 ) ;}. . . .// Or , compute e l ement i n output d i r e c t l y ( same as l a s t s l i d e )AB. c e l l ( x , y ) from (A . row ( y ) a , B . column ( x ) b){
r e t u r n dot ( a , b ) ;}
}
Introduction PetaBricks OpenTuner Conclusions
Variable Accuracy Algorithms
• Many problems don’t have a single correct answer,optimizations often trade-off accuracy and performance.
• Soft computing• DSP algorithms• Iterative algorithms
• Variable accuracy, supported in the PetaBricks language, is afundamental part of algorithmic choice which enables newclasses of programs to be represented.
Introduction PetaBricks OpenTuner Conclusions
Variable Accuracy Algorithms
• Many problems don’t have a single correct answer,optimizations often trade-off accuracy and performance.
• Soft computing• DSP algorithms• Iterative algorithms
• Variable accuracy, supported in the PetaBricks language, is afundamental part of algorithmic choice which enables newclasses of programs to be represented.
Introduction PetaBricks OpenTuner Conclusions
K-Means Examplet r a n s f o r m kmeansfrom Po in t s [ n , 2 ] // Array o f p o i n t s ( each column
// s t o r e s x and y c o o r d i n a t e s )u s i n g Cen t r o i d s [ s q r t ( n ) , 2 ]to Ass ignments [ n ]{
// Rule 1 :// One p o s s i b l e i n i t i a l c o n d i t i o n : Random// s e t o f p o i n t sto ( C en t r o i d s . column ( i ) c ) from ( Po i n t s p ) {
c=p . column ( rand (0 , n ) )}
// Rule 2 :// Another i n i t i a l c o n d i t i o n : C en t e r p l u s i n i t i a l// c e n t e r s ( kmeans++)to ( C en t r o i d s c ) from ( Po i n t s p ) {
Cen t e rP l u s ( c , p ) ;}
// Rule 3 :// The kmeans i t e r a t i v e a l g o r i t hmto ( Ass ignments a ) from ( Po i n t s p , C en t r o i d s c ) {
w h i l e ( t r u e ) {i n t change ;A s s i g nC l u s t e r s ( a , change , p , c , a ) ;i f ( change==0) r e t u r n ; // Reached f i x e d po i n tNewC lu s t e rLoca t i on s ( c , p , a ) ;
}}
}
Introduction PetaBricks OpenTuner Conclusions
K-Means Example (Variable Accuracy)t r a n s f o r m kmeansa c c u r a c y m e t r i c kmeansaccuracya c c u r a c y v a r i a b l e kfrom Po in t s [ n , 2 ] // Array o f p o i n t s ( each column
// s t o r e s x and y c o o r d i n a t e s )u s i n g Cen t r o i d s [ k , 2 ]to Ass ignments [ n ]
...
// Rule 3 :// The kmeans i t e r a t i v e a l g o r i t hmto ( Ass ignments a ) from ( Po i n t s p , C en t r o i d s c ) {
f o r e n o u g h {i n t change ;A s s i g nC l u s t e r s ( a , change , p , c , a ) ;i f ( change==0) r e t u r n ; // Reached f i x e d po i n tNewC lu s t e rLoca t i on s ( c , p , a ) ;
}}
}t r a n s f o r m kmeansaccuracyfrom Ass ignments [ n ] , Po i n t s [ n , 2 ]to Accuracy{
Accuracy from ( Ass ignments a , Po i n t s p){r e t u r n s q r t (2∗n/ SumClus te rD i s tanceSquared ( a , p ) ) ;
}}
Introduction PetaBricks OpenTuner Conclusions
Semantics of Variable Accuracy
Running the accuracy metric on the output will return a valuethat, in expectation, exceeds the accuracy target more than Ppercent of the time.
• Expected distribution of accuracy measured during autotuningtime, not at runtime.
• When fixed accuracy code calls variable accuracy code, anaccuracy target must be specified.
• When variable accuracy code call code containing variableaccuracy components, only the outer most accuracy targetwill be honored.
Introduction PetaBricks OpenTuner Conclusions
Semantics of Variable Accuracy
Running the accuracy metric on the output will return a valuethat, in expectation, exceeds the accuracy target more than Ppercent of the time.
• Expected distribution of accuracy measured during autotuningtime, not at runtime.
• When fixed accuracy code calls variable accuracy code, anaccuracy target must be specified.
• When variable accuracy code call code containing variableaccuracy components, only the outer most accuracy targetwill be honored.
Introduction PetaBricks OpenTuner Conclusions
A Brief Multigrid Intro
• Used to iteratively solve PDEs over a gridded domain
• Relaxations update points using neighboring values (stencilcomputations)
• Restrictions and Interpolations compute new grid with coarseror finer discretization
Introduction PetaBricks OpenTuner Conclusions
Standard Cycle Shaps
• Cycle shapes effect accuracy andperformance
• Equation, accuracy target, data,and execution platform effectefficacy of different shapes
• Entire papers published about newcycle shapes!
• We fundamentally change thestatus quo in this domain
• Define the search space of cycleshapes once
• Autotune to find a cycle shapetailored to your problem
Introduction PetaBricks OpenTuner Conclusions
Standard Cycle Shaps
• Cycle shapes effect accuracy andperformance
• Equation, accuracy target, data,and execution platform effectefficacy of different shapes
• Entire papers published about newcycle shapes!
• We fundamentally change thestatus quo in this domain
• Define the search space of cycleshapes once
• Autotune to find a cycle shapetailored to your problem
Introduction PetaBricks OpenTuner Conclusions
Autotuned V-cycle Shapes
101
Gri
d S
ize
2048
1024
512
256
128
64
32
16
103
105
107
Gri
d S
ize
2048
1024
512
256
128
64
32
16
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
⇒ Grid size 2i
• Partition accuracy space into discrete levels
• Base space of candidate algorithms on optimal algorithmsfrom coarser level
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
⇒ Grid size 2i
• Partition accuracy space into discrete levels
• Base space of candidate algorithms on optimal algorithmsfrom coarser level
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
⇒ Grid size 2i
• Partition accuracy space into discrete levels
• Base space of candidate algorithms on optimal algorithmsfrom coarser level
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
⇒ Grid size 2i
• Partition accuracy space into discrete levels
• Base space of candidate algorithms on optimal algorithmsfrom coarser level
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
⇒Grid size 2i
• Partition accuracy space into discrete levels
• Base space of candidate algorithms on optimal algorithmsfrom coarser level
Introduction PetaBricks OpenTuner Conclusions
2D Poisson’s Equation (uses Multigrid)
1
2
4
8
16
32
64
100 1000 10000 100000 1e+06
Spe
edup
(x)
Input Size
Accuracy Level 109
Accuracy Level 107
Accuracy Level 105
Accuracy Level 103
Accuracy Level 101
2D Poisson’s equation
Introduction PetaBricks OpenTuner Conclusions
More Variable Accuracy Results
1
2
4
8
10 100 1000
Spe
edup
(x)
Input Size
Accuracy Level 0.95Accuracy Level 0.75Accuracy Level 0.50Accuracy Level 0.20Accuracy Level 0.10Accuracy Level 0.05
Clustering
1
10
100
1000
10000
10 100 1000 10000 100000 1e+06S
peed
up (
x)
Input Size
Accuracy Level 1.01Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4
Bin Packing
1
2
4
8
16
32
10 100 1000 10000
Spe
edup
(x)
Input Size
Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.8Accuracy Level 0.6Accuracy Level 0.3
Image Compression
1
2
4
8
16
32
10 100 1000 10000 100000
Spe
edup
(x)
Input Size
Accuracy Level 109
Accuracy Level 107
Accuracy Level 105
Accuracy Level 103
Accuracy Level 101
3D Helmholtz
1
2
4
8
16
32
64
100 1000 10000 100000 1e+06
Spe
edup
(x)
Input Size
Accuracy Level 109
Accuracy Level 107
Accuracy Level 105
Accuracy Level 103
Accuracy Level 101
2D Poisson
1
2
4
8
10 100 1000 10000
Spe
edup
(x)
Input Size
Accuracy Level 3.0Accuracy Level 2.0Accuracy Level 1.5Accuracy Level 1.0Accuracy Level 0.5Accuracy Level 0.0
Preconditioner
Introduction PetaBricks OpenTuner Conclusions
Results on Different Systems
Test SystemsCodename CPU(s) Cores GPU OpenCL Runtime
Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2
Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5
Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2
BenchmarksName
# PossibleConfigs
Generated OpenCLKernels
MeanAutotuning Time
TestingInput Size
SeparableConv. 101358 9 3.82 hours 35202
Black-Sholes 10130 1 3.09 hours 500000
Poisson2D SOR 101358 25 15.37 hours 20482
Sort 10920 7 3.56 hours 220
Strassen 101509 9 3.05 hours 10242
SVD 102435 8 1.79 hours 2562
Tridiagonal Solver 101040 8 5.56 hours 10242
Introduction PetaBricks OpenTuner Conclusions
Results on Different Systems
Test SystemsCodename CPU(s) Cores GPU OpenCL Runtime
Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2
Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5
Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2
BenchmarksName
# PossibleConfigs
Generated OpenCLKernels
MeanAutotuning Time
TestingInput Size
SeparableConv. 101358 9 3.82 hours 35202
Black-Sholes 10130 1 3.09 hours 500000
Poisson2D SOR 101358 25 15.37 hours 20482
Sort 10920 7 3.56 hours 220
Strassen 101509 9 3.05 hours 10242
SVD 102435 8 1.79 hours 2562
Tridiagonal Solver 101040 8 5.56 hours 10242
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x
2.0x
3.0x
Desktop
Execution T
ime (
Norm
aliz
ed) Desktop Config
Server ConfigLaptop Config
Desktop Config Server Config Laptop Config
SeparableConv.1D kernel+local memory onGPU
1D kernel on OpenCL2D kernel+local memory onGPU
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x
2.0x
3.0x
Desktop Server Laptop
Execution T
ime (
Norm
aliz
ed) Desktop Config
Server ConfigLaptop Config
Desktop Config Server Config Laptop Config
SeparableConv.1D kernel+local memory onGPU
1D kernel on OpenCL2D kernel+local memory onGPU
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x
2.0x
3.0x
Desktop Server Laptop
Execution T
ime (
Norm
aliz
ed) Desktop Config
Server ConfigLaptop Config
Hand-coded OpenCL
Desktop Config Server Config Laptop Config
SeparableConv.1D kernel+local memory onGPU
1D kernel on OpenCL2D kernel+local memory onGPU
Introduction PetaBricks OpenTuner Conclusions
Poisson 2D SOR
1.0x
3.0x
5.0x
7.0x
9.0x
Desktop Server Laptop
Execution T
ime (
Norm
aliz
ed) Desktop Config
Server ConfigLaptop Config
Desktop Config Server Config Laptop Config
Poisson2D SORSplit on CPU followed bycompute on GPU
Split some parts on OpenCLfollowed by compute on CPU
Split on CPU followed bycompute on GPU
Introduction PetaBricks OpenTuner Conclusions
Singular Value Decomposition (SVD)
1.0x
1.2x
1.4x
1.6x
1.8x
2.0x
Desktop Server Laptop
Execution T
ime (
Norm
aliz
ed) Desktop Config
Server ConfigLaptop Config
Desktop Config Server Config Laptop Config
SVD
First phase: task parallism be-tween CPU/GPU; matrix multi-ply: 8-way parallel recursive de-composition on CPU, call LA-PACK when < 42× 42
First phase: all on CPU; ma-trix multiply: 8-way parallel re-cursive decomposition on CPU,call LAPACK when < 170×170
First phase: all on CPU; ma-trix multiply: 4-way parallel re-cursive decomposition on CPU,call LAPACK when < 85× 85
Introduction PetaBricks OpenTuner Conclusions
Results Takeaways
• Different configurations are required for best performance ondifferent systems
• Not just changing block sizes
• Can not be easily solved by a simple heuristic
• Motivates the need for algorithmic choice and autotuning
Introduction PetaBricks OpenTuner Conclusions
Autotuning Challenges
• Evaluating quality of candidate algorithms is expensive• Must run the program (at least once)• More expensive for unfit solutions• Scales poorly with larger problem sizes
• Fitness is noisy• Randomness from parallel races and system noise• Testing each candidate only once often produces a worse
algorithm• Running many trials is expensive
• Decision tree structures are complex• Not easy to hill-climb• We artificially bound them
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity
• Input sensitivity is a major challenge
• Different algorithms may be better for different inputs
• Use fast algorithm for easy inputs, slow algorithm for hardinputs
• Avoid pathological cases
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Today
• Vast majority of programs today use a single algorithm for allinputs
• This forces design for the “worst case” input• Wastes time and resources
• Related work:• Uses hand written heuristics to adapt to inputs• Rectify inputs for security [Long el al.]
• Our system automatically classifies inputs and runs a programoptimized for the type of input being processed
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Today
• Vast majority of programs today use a single algorithm for allinputs
• This forces design for the “worst case” input• Wastes time and resources
• Related work:• Uses hand written heuristics to adapt to inputs• Rectify inputs for security [Long el al.]
• Our system automatically classifies inputs and runs a programoptimized for the type of input being processed
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Overview
TrainingDeployment
Input Classifier
Input Aware Learning
Program
Training Inputs
Feature ExtractorsInsights: - Feature Priority List - Performance Bounds
Input
Select Input Optimized Programs
Training
Selected Program
Run
Introduction PetaBricks OpenTuner Conclusions
Input Featuresf u n c t i o n Sor tto out [ n ]from i n [ n ]
i n p u t f e a t u r e So r t edne s s , Dup l i c a t i o n
{ . . . }f u n c t i o n So r t e dn e s sfrom i n [ n ]to s o r t e d n e s st u n a b l e doub l e l e v e l ( 0 . 0 , 1 . 0 ){
i n t s o r t e d coun t = 0 ;i n t count = 0 ;i n t s t e p = ( i n t ) ( l e v e l ∗n ) ;f o r ( i n t i =0; i+step<n ; i+=s t ep ) {
i f ( i n [ i ] <= in [ i+s t ep ] ) {// inc r ement f o r c o r r e c t l y o r d e r ed// p a i r s o f e l ement ss o r t e d coun t += 1 ;
}count += 1 ;
}i f ( count > 0)
s o r t e d n e s s = so r t e d coun t / ( doub l e ) count ;e l s e
s o r t e d n e s s = 0 . 0 ;}f u n c t i o n Dup l i c a t i o nfrom i n [ n ]to d u p l i c a t i o n{ . . . }
Introduction PetaBricks OpenTuner Conclusions
Training
FeaturesInput
Labels
Decision Tree
Max A Priori
Adaptive Tree
Classifier Constructors
1...m
0
m+1
Classifier Selector
Selection Objective
Considers cost of
extracting needed features
Input Classifier
Introduction PetaBricks OpenTuner Conclusions
How Many Landmarks Are Enough?
0 0.2 0.4 0.6 0.8 1
Lost
spe
edup
(L)
Size of region (pi)
2 configs3 configs4 configs5 configs6 configs7 configs8 configs9 configs
10 20 30 40 50 60 70 80 90 100S
peed
upLandmarks
Introduction PetaBricks OpenTuner Conclusions
Input Adaptation Results
sort
1 100
1
3
5
Landmarks
Spe
edup
clustering
1 100
2
3
Landmarks
Spe
edup
binpacking
1 100
1
1.05
Landmarks
Spe
edup
svd
1 100
0.9
1.1
Landmarks
Spe
edup
poisson2d
1 100
0.9
1.2
Landmarks
Spe
edup
helmholtz3d
1 100
0.8
1.0
Landmarks
Spe
edup
Introduction PetaBricks OpenTuner Conclusions
Related Projects
A small selection of many related projects:Package Domain Search Method
Active Harmony Runtime System Nelder-Mead
ATLAS Dense Linear Algebra Exhaustive
Code Perforation Compiler Exhaustive + Simulated Annealing
Dynamic Knobs Runtime System Control Theory
FFTW Fast Fourier Transform Exhaustive / Dynamic Prog.
Insieme Compiler Differential Evolution
Milepost GCC / cTuning Compiler IID Model + Central DB
OSKI Sparse Linear Algebra Exhaustive + Heuristic
PATUS Stencil Computations Nelder-Mead or Evolutionary
SEEC / Heartbeats Runtime System Control Theory
Sepya Stencil Computations Random-Restart Gradient Ascent
SPIRAL DSP Algorithms Pareto Active Learning
• Simple techniques (exhaustive, hill climbers, etc) are popular• No single technique is best for all problems
• Representations are often just integers/floats/booleans
Introduction PetaBricks OpenTuner Conclusions
Related Projects
A small selection of many related projects:Package Domain Search Method
Active Harmony Runtime System Nelder-Mead
ATLAS Dense Linear Algebra Exhaustive
Code Perforation Compiler Exhaustive + Simulated Annealing
Dynamic Knobs Runtime System Control Theory
FFTW Fast Fourier Transform Exhaustive / Dynamic Prog.
Insieme Compiler Differential Evolution
Milepost GCC / cTuning Compiler IID Model + Central DB
OSKI Sparse Linear Algebra Exhaustive + Heuristic
PATUS Stencil Computations Nelder-Mead or Evolutionary
SEEC / Heartbeats Runtime System Control Theory
Sepya Stencil Computations Random-Restart Gradient Ascent
SPIRAL DSP Algorithms Pareto Active Learning
• Simple techniques (exhaustive, hill climbers, etc) are popular• No single technique is best for all problems
• Representations are often just integers/floats/booleans
Introduction PetaBricks OpenTuner Conclusions
Limits of Existing Autotuning Projects
• We believe these factors limit the scope and efficiency ofautotuning
• A hill climber works great for a block size, but completely failsat synthesizing poly-algorithms
• Many users of autotuning work hard to prune their searchspaces to fit their techniques
• OpenTuner provides extensible representations and ensemblesof techniques which can solve more complex autotuningproblems
Introduction PetaBricks OpenTuner Conclusions
Limits of Existing Autotuning Projects
• We believe these factors limit the scope and efficiency ofautotuning
• A hill climber works great for a block size, but completely failsat synthesizing poly-algorithms
• Many users of autotuning work hard to prune their searchspaces to fit their techniques
• OpenTuner provides extensible representations and ensemblesof techniques which can solve more complex autotuningproblems
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Overview
OpenTuner: an extensible framework for program autotuning
Results Database
Search TechniquesSearch
Driver
Search
Reads: ResultsWrites: Desired Results
Measurement
User Defined Measurement
Function
Measurement Driver
Configuration Manipulator
Reads: Desired ResultsWrites: Results
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Configuration Manipulator Parameters
Parameter
Primitive Complex
Integer ScaledNumericFloat
LogInteger LogFloat PowerOfTwo
Switch Enum Permutation
Schedule
SelectorBoolean
• Hierarchical structure of parameters, user defined parametertypes can be added at any point
• Primitive parameters behave like bounded integers or floats
• Complex parameters have a set of stochastic mutationoperators
• Technique-specific operators
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Information sharing through ResultsDB
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Information sharing through ResultsDB
AUC Bandit
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Information sharing through ResultsDB
AUC Bandit
Which configuration should we try next?
?
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Information sharing through ResultsDB
AUC Bandit
Which configuration should we try next?
33%
Exploration
33% 33%
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution
Particle Swarm
Optimization
TorczonHill
Climber
Information sharing through ResultsDB
AUC Bandit
Which configuration should we try next?
100%
Exploitation
0% 0%
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results
Project Benchmark Possible Configurations
GCC/G++ Flags all 10806
Halide Blur 1052
Halide Wavelet 1044
HPL n/a 109.9
PetaBricks Poisson 103657
PetaBricks Sort 1090
PetaBricks Strassen 10188
PetaBricks TriSolve 101559
Stencil all 106.5
Unitary n/a 1021
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results
Project Benchmark Possible Configurations
GCC/G++ Flags all 10806
Halide Blur 1052
Halide Wavelet 1044
HPL n/a 109.9
PetaBricks Poisson 103657
PetaBricks Sort 1090
PetaBricks Strassen 10188
PetaBricks TriSolve 101559
Stencil all 106.5
Unitary n/a 1021
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results: GCC Flags
fft.c
0.8
0.85
0.9
0.95
1
0 300 600 900 1200 1500 1800
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
gcc -O1gcc -O2gcc -O3
OpenTuner
matrixmultiply.cpp
0.1
0.15
0.2
0.25
0.3
0 300 600 900 1200 1500 1800
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
g++ -O1g++ -O2g++ -O3
OpenTuner
raytracer.cpp
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 300 600 900 1200 1500 1800
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
g++ -O1g++ -O2g++ -O3
OpenTuner
tsp ga.cpp
0.4
0.45
0.5
0.55
0.6
0 300 600 900 1200 1500 1800
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
g++ -O1g++ -O2g++ -O3
OpenTuner
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results: PetaBricks
Poisson 2D
0 0.02 0.04 0.06 0.08
0.1 0.12 0.14 0.16 0.18
0.2
0 600 1200 1800 2400 3000 3600
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
PetaBricks AutotunerOpenTuner
Sort
0
0.02
0.04
0.06
0.08
0.1
0 600 1200
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
PetaBricks AutotunerOpenTuner
Strassen
0
0.05
0.1
0.15
0.2
0 600 1200
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
PetaBricks AutotunerOpenTuner
Tridiagonal Solver
0.0095
0.01
0.0105
0.011
0.0115
0 600 1200
Exe
cutio
n Ti
me
(sec
onds
)
Autotuning Time (seconds)
PetaBricks AutotunerOpenTuner
Introduction PetaBricks OpenTuner Conclusions
Conclusions
• PetaBricks has pushed the limits of what can be done withalgorithmic choice
• Provides performance portability by allowing programs toadapt to their environment
• Have shown: variable accuracy, multigrid, and input sensitivity• Hope that future main stream programming languages will
incorperate algorithmic choice and autotuning
• OpenTuner can expand the scope of program autotuning forother projects
• Extensible configuration representation• Ensembles of techniques• Hope that field of autotuning will expand to much more
complex problems
Introduction PetaBricks OpenTuner Conclusions
Coauthors and Collaborators
• Saman Amarasinghe
• Cy Chan
• Yufei Ding
• Alan Edelman
• Sam Fingeret
• Sanath Jayasena
• Shoaib Kamil
• Kevin Kelley
• Erika Lee
• Deepak Narayanan
• Marek Olszewski
• Una-May O’Reilly
• Maciej Pacula
• Phitchaya Mangpo Phothilimthana
• Jonathan Ragan-Kelley
• Xipeng Shen
• Michele Tartara
• Kalyan Veeramachaneni
• Yod Watanaprakornku
• Yee Lok Wong
• Kevin Wu
• Minshu Zhan
• Qin Zhao
Introduction PetaBricks OpenTuner Conclusions
Thanks!
About me:http://jasonansel.com/
http://opentuner.org/
http://projects.csail.mit.edu/petabricks/
Recommended