Upload
valerie-stanley
View
218
Download
0
Embed Size (px)
Citation preview
Car
neg
ie M
ello
n
SPIRAL: An OverviewSPIRAL: An Overview
• José Moura (CMU)• Jeremy Johnson (Drexel)• Robert Johnson (MathStar)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)
• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Edward Wertz (CMU)• Jianxin Xiong (UIUC)
Faculty
Students
http://www.ece.cmu.edu/~spiral
José M.F. Moura and Markus Püschel
Collaborators• Christoph Überhuber (TU Vienna)• Franz Franchetti (TU Vienna)
Car
neg
ie M
ello
n
SponsorSponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
Car
neg
ie M
ello
nMoore’s Moore’s Law and High(est)Law and High(est) Performance Performance Scientific ComputingScientific Computing
arithmetic cost model (counting adds and mults) is not accurate for predicting runtime best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations
Moore’s Law: processor-memory bottleneck short life cycles of computers very complex architectures
• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features
(single processor, off-the-shelf)
Consequences for software/algorithms:
Portable performance requires automation
Car
neg
ie M
ello
nSPIRALSPIRAL
Automates
cuts development costs code less error-prone
takes advantage of architecture specific features porting without loss of performance
systematic exploration of alternatives both at algorithmic and code level
are performance critical
Implementation
Platform-Adaptation
Optimization
of DSP algorithms
A library generator for highly optimized signal processing algorithms
Car
neg
ie M
ello
nSPIRAL SPIRAL systemsystem
DSP transformspecifies
user
goes for a coffee
Formula Generator
SPL Compiler Sea
rch
En
gin
e
runtime on given platform
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcodeS P
I R
A L
(or an
espres
so fo
r small tran
sfo
rms)
platform-adaptedimplementation
comes back
Car
neg
ie M
ello
nRelatedRelated Work on Code Generation/Adaptation Work on Code Generation/Adaptation
PhiPAC, ATLAS (Linear algebra) Enumeration and evaluation of different blocking, looping, etc.
strategies for BLAS routines
SPARSITY (sparse matrix-vector multiply) Search for optimal blocking strategy to improve register
performance
FFTW (discrete Fourier transform package) Generated code modules (machine independent) for small
sizes Flexible recursion to adapt to memory hierarchy
SPIRAL
Code generation and adaptation for an entire domain (linear transforms) of structurally complex algorithms
Adaptation to all architecture features (memory, cache, register, etc.) by automatic exploration of algorithm space
Car
neg
ie M
ello
n
DSP Transform
Algorithm
Car
neg
ie M
ello
nDSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT
Cooley/Tukey FFT (size 4):
algorithms reduce arithmetic cost O(n^2)O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure
1000
0010
0100
0001
1100
1100
0011
0011
000
0100
0010
0001
1010
0101
1010
0101
11
1111
11
1111
iii
ii
4222
42224 )()( LDFTITIDFTDFT
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
Car
neg
ie M
ello
n
DSPDSP Algorithms: Terminology (SPIRAL) Algorithms: Terminology (SPIRAL)
Transform
Rule
Formula
PDFTIDIDFTDFT mnmnnm
PFIIDIFDFT 222428
parameterized matrix
• a breakdown strategy • product of sparse matrices
• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation
nDFT
Ruletree 8DFT
2DFT 4DFT
2DFT 2DFT
• few constructs and primitives• uniquely defines an algorithm• can be translated into code
Car
neg
ie M
ello
n
DSPDSP Transforms Transforms
nkliDFTn /2exp
222 DFTDFTWHT k
nlkDCT nII /2/1cos)(
nlkDCT nIV /)2/1(2/1cos)(
nklDST nI /sin)(
nlnkMDCT nn /)2/1)(2/)1((cos2
Others: filters, discrete wavelet transforms, Haar, Hartley, …
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sineTransforms (16 types)
modified discrete cosine transform
TT two-dimensional transform
Car
neg
ie M
ello
n
RulesRules = Breakdown Strategies = Breakdown Strategies
Qn
IVn
IIn
IIn FIDCTDCTPDCT 22/
)(2/
)(2/
)(
DDCTSDCT IIn
IVn )()(
2)(
2 2/1,1 FdiagDCT II
rIV
n MMDCT 1)(
1 1 12 2 2 21
( )n n n n n ni i i t
n
i
WHT I WHT I
EWIPIWDWTWDWT knnnn )())(()( 2/2/2/
PDFTIDIDFTDFT mnmnnm
CDSTDCTBDFT In
Inn )( )(
2/)(2/
base case
recursive
translation
iterative
recursive
recursive
recursive
iterative/recursive
built from few constructs and primitives
Car
neg
ie M
ello
nAlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas
)(8
IIDCT
)(4
IIDCT)(
4IVDCT
R1)()( 2/2
)(2/
)(2/
)(n
IVn
IIn
IIn IFDCTDCTPDCT
2FR3 R6
2FR4
R3
R1
R6
2F
2FR4
2)(
22
1FDCT II
)(2
IIDCT
)(2
IIDST
)(2
IVDCT
)(2
IIDST
)(2
IIDCT
R1R6 SDCTPDCT II
nIV
n )()(
)(4
IIDCT)(2
IVDCT
Car
neg
ie M
ello
n
FormulaFormula for a DCT, size 16 for a DCT, size 16
Car
neg
ie M
ello
n
NumberNumber of Formulas/Algorithms of Formulas/Algorithms
k
123456789
# DFTs, size 2^k
16
40296
27744162570361280~1.01 • 10^27~2.31 • 10^61
~2.86 • 10^133
# DCT IV, size 2^k
110
12631242
19244433627343815121631354242
~1.07 • 10^38~2.30 • 10^76
~1.06 • 10^153
exponential search space
Using the rules included in SPIRAL:
Car
neg
ie M
ello
n
Algorithm (Formula)
Implementation
DSP Transform
Car
neg
ie M
ello
nFormulasFormulas in SPL in SPL
( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )
• • • •
• • • •
Car
neg
ie M
ello
nSPLSPL Syntax (Subset) Syntax (Subset)
matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off
allows extension of SPL
controls loop unrolling
Car
neg
ie M
ello
n
SPLSPL Compiler, 4-point FFT Compiler, 4-point FFT
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2))
f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4
r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6
fast algorithmas
formulaas
SPL program#codetype
complex real
Car
neg
ie M
ello
nSPLSPL Compiler: Summary Compiler: Summary
Parsing
Intermediate Code Generation
Intermediate Code Restructuring
Target Code Generation
Symbol Table Abstract Syntax Tree
I-Code
I-Code
C, FORTRAN function
Template Table
SPL Formula Template DefinitionSymbol Definition
OptimizationI-Code
SPL Program
Built-in optimizations:
single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE
Extensible through templates
Car
neg
ie M
ello
n
SIMDSIMD Short Vector Extensions Short Vector Extensions
+ x
vector length = 4(4-way)
Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Requires fine grain parallelism Large potential speed-up
SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic (compiler) vectorization very limited
Problems:
very difficult to use
Car
neg
ie M
ello
nVector Vector code generation from SPL formulascode generation from SPL formulas
Naturally vectorizable construct
A
x y
4IAvector length
iiii
k
ii QEIADP )(
1
Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length
(Current) generic construct completely vectorizable:
Vectorization in two steps:
1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)
Car
neg
ie M
ello
n
Algorithm (Formula)
Implementation
DSP Transform
Search
Car
neg
ie M
ello
nWhyWhy Search? Search?
DCT, type IV, size 16
maaaany different formulas large spread in runtimes, even for modest size precisely equal arithmetic cost best formula is platform-dependent
~31000 formulas
Toy problem:scheduled
Car
neg
ie M
ello
nSearchSearch Methods available in SPIRAL Methods available in SPIRAL
Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm)
Possible FormulasSizes Timed Results
Exhaust Very small All Best
DP All 10s-100s (very) good
Random All User decided fair/good
Hill Climbing All 100s-1000s Good
STEER All 100s-1000s (very) good
Search over • algorithm space and • implementation options (degree of unrolling)
Car
neg
ie M
ello
nSTEERSTEER
Population n:
Population n+1:
……
……
Mutation
Cross-Breeding
expanddifferently
swapexpansions
Survival of Fittest
Car
neg
ie M
ello
n
LearningLearning to Generate Fast Algorithms to Generate Fast Algorithms
• Learns from given dataset (formulas + runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT
Car
neg
ie M
ello
n
Experimental ResultsExperimental Results
Car
neg
ie M
ello
nGeneratedGenerated DFT Code: Pentium 4, SSE DFT Code: Pentium 4, SSE
(Pse
ud
o)
gfl
op
/s
DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (vector to C code) up to factor of 3.1
0
1
2
3
4
5
6
7
4 5 6 7 8 9 10 11 12 13
Spiral SSE
Intel MKL interl.
FFTW 2.1.3
Spiral C
Spiral C vect
SIMD-FFT
hand-tuned vendor assembly code
*P. Rodriguez. A Radix-2 FFT Algorithm for Modern Single Instruction Multiple Data (SIMD) Architectures. Proc. ICASSP 2002
*
Car
neg
ie M
ello
nGeneratedGenerated DFT Code: Pentium 4, SSE2 DFT Code: Pentium 4, SSE2
gfl
op
s
DFT 2n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (vector to C code) up to factor of 1.8
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4 5 6 7 8 9 10 11 12
Spiral SSE2
Spiral SSE2 exh.
Intel MKL split
FFTW 2.1.3
Spiral C
Spiral C vect
Car
neg
ie M
ello
n
Other Other transformstransforms
0
0.5
1
1.5
2
2.5
3
3.5
4 5 6 7 8 9 10 11 12 13 14
Spiral float
Spiral C vect
Spiral C SSE
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4x4 8x8 16x16 32x32 64x64 128x128
Spiral SSE
Spiral C vect
Spiral C float
gfl
op
s
transform size transform size
2-dim DCT 2n x 2n
Pentium 4, 2.53 GHz, SSEWHT 2n
Pentium 4, 2.53 GHz, SSE
speedups (vector to C code) up to factor of 3
WHT has only additions very simple transform
Car
neg
ie M
ello
nBest Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024
scalar
C vect
SIMD
Pentium 4float
Pentium 4double
Pentium IIIfloat
AthlonXPfloat
10
6
4
4
10
8
7
5
10
8
6
4
2
2
2
2 2
2 2
2 2
2
2
1
2
2 3
10
8
5
2
2
2 3
10
9
7
5
1
2
22 3
10
6
2
4
2 2
2 2
4
10
4
2
6
2 4
2 2
2
10
5
2
5
2 3 3
10
6
3
4
2 2 3
10
8
5
2
2
2 3
10
6
4
4
2 2
2 2
2
10
5
2
5
2 3 3
trees platform/datatype dependent
Car
neg
ie M
ello
nCrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4
Rel
ativ
e p
erfo
rman
ce w
.r.t
. b
est
DFT 2n single precision, runtime of best found of other platforms
n
software adaptation is necessary
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
4 5 6 7 8 9 10 11 12 13
Pentium 4 SSE
Pentium 4 SSE2
AthlonXP SSE
PentiumIII SSE
Pentium 4 float
e.g., ~50% performance loss byusing PIII code on P4
Car
neg
ie M
ello
nConclusionsConclusions
Mathematical computer representation of algorithms
Automatic translation of algorithms into code
SPIRAL closes the gap between math domain (algorithms) andimplementation domain (programs)
High level: Mathematical manipulation of algorithms
Low level: Coding degrees of freedom
SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives
http://www.ece.cmu.edu/~spiral
Car
neg
ie M
ello
nReferencesReferences
Related Work• R.C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing 1998. Math-atlas.sourceforge.net• M. Frigo and S.-G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. ICASSP 1998, pp. 1381-1384. www.fftw.org• E.-J. Im and K. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proc. ICCS 2001, pp. 127-136.
Further Reading on SPIRAL
• M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. To appear in Journal of High Performance Computing and Applications.• J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. PLDI 2001, pp. 298-308.• Bryan Singer and Manuela Veloso. Automating the Modeling and Optimization of the Performance of Signal Transforms. IEEE Trans. Signal Processing, 50(8), 2002, pp. 2003-2014.• F. Franchetti and M. Püschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proc. IPDPS 2002.• F. Franchetti and M. Püschel. Short Vector Code Generation for the Discrete Fourier Transform. To appear in Proc. IPDPS 2003.
http://www.ece.cmu.edu/~spiral