Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus

Car

neg

ie M

ello

n

SPIRAL: An OverviewSPIRAL: An Overview

• José Moura (CMU)• Jeremy Johnson (Drexel)• Robert Johnson (MathStar)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)

• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Edward Wertz (CMU)• Jianxin Xiong (UIUC)

Faculty

Students

http://www.ece.cmu.edu/~spiral

José M.F. Moura and Markus Püschel

Collaborators• Christoph Überhuber (TU Vienna)• Franz Franchetti (TU Vienna)

Car

neg

ie M

ello

n

SponsorSponsor

Work supported by DARPA (DSO), Applied & Computational

Mathematics Program, OPAL, through grant managed by

research grant DABT63-98-1-0004 administered by the Army

Directorate of Contracting.

Car

neg

ie M

ello

nMoore’s Moore’s Law and High(est)Law and High(est) Performance Performance Scientific ComputingScientific Computing

arithmetic cost model (counting adds and mults) is not accurate for predicting runtime best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations

Moore’s Law: processor-memory bottleneck short life cycles of computers very complex architectures

• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features

(single processor, off-the-shelf)

Consequences for software/algorithms:

Portable performance requires automation

Car

neg

ie M

ello

nSPIRALSPIRAL

Automates

cuts development costs code less error-prone

takes advantage of architecture specific features porting without loss of performance

systematic exploration of alternatives both at algorithmic and code level

are performance critical

Implementation

Platform-Adaptation

Optimization

of DSP algorithms

A library generator for highly optimized signal processing algorithms

Car

neg

ie M

ello

nSPIRAL SPIRAL systemsystem

DSP transformspecifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch

En

gin

e

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcodeS P

I R

A L

(or an

espres

so fo

r small tran

sfo

rms)

platform-adaptedimplementation

comes back

Car

neg

ie M

ello

nRelatedRelated Work on Code Generation/Adaptation Work on Code Generation/Adaptation

PhiPAC, ATLAS (Linear algebra) Enumeration and evaluation of different blocking, looping, etc.

strategies for BLAS routines

SPARSITY (sparse matrix-vector multiply) Search for optimal blocking strategy to improve register

performance

FFTW (discrete Fourier transform package) Generated code modules (machine independent) for small

sizes Flexible recursion to adapt to memory hierarchy

SPIRAL

Code generation and adaptation for an entire domain (linear transforms) of structurally complex algorithms

Adaptation to all architecture features (memory, cache, register, etc.) by automatic exploration of algorithm space

Car

neg

ie M

ello

n

DSP Transform

Algorithm

Car

neg

ie M

ello

nDSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT

Cooley/Tukey FFT (size 4):

algorithms reduce arithmetic cost O(n^2)O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure

1000

0010

0100

0001

1100

1100

0011

0011

000

0100

0010

0001

1010

0101

1010

0101

11

1111

11

1111

iii

ii

4222

42224 )()( LDFTITIDFTDFT

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Car

neg

ie M

ello

n

DSPDSP Algorithms: Terminology (SPIRAL) Algorithms: Terminology (SPIRAL)

Transform

Rule

Formula

PDFTIDIDFTDFT mnmnnm

PFIIDIFDFT 222428

parameterized matrix

• a breakdown strategy • product of sparse matrices

• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation

nDFT

Ruletree 8DFT

2DFT 4DFT

2DFT 2DFT

• few constructs and primitives• uniquely defines an algorithm• can be translated into code

Car

neg

ie M

ello

n

DSPDSP Transforms Transforms

nkliDFTn /2exp

222 DFTDFTWHT k

nlkDCT nII /2/1cos)(

nlkDCT nIV /)2/1(2/1cos)(

nklDST nI /sin)(

nlnkMDCT nn /)2/1)(2/)1((cos2

Others: filters, discrete wavelet transforms, Haar, Hartley, …

discrete Fourier transform

Walsh-Hadamard transform

discrete cosine and sineTransforms (16 types)

modified discrete cosine transform

TT two-dimensional transform

Car

neg

ie M

ello

n

RulesRules = Breakdown Strategies = Breakdown Strategies

Qn

IVn

IIn

IIn FIDCTDCTPDCT 22/

)(2/

)(2/

)(

DDCTSDCT IIn

IVn )()(

2)(

2 2/1,1 FdiagDCT II

rIV

n MMDCT 1)(

1 1 12 2 2 21

( )n n n n n ni i i t

n

i

WHT I WHT I

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

PDFTIDIDFTDFT mnmnnm

CDSTDCTBDFT In

Inn )( )(

2/)(2/

base case

recursive

translation

iterative

recursive

recursive

recursive

iterative/recursive

built from few constructs and primitives

Car

neg

ie M

ello

nAlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas

)(8

IIDCT

)(4

IIDCT)(

4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

22

1FDCT II

)(2

IIDCT

)(2

IIDST

)(2

IVDCT

)(2

IIDST

)(2

IIDCT

R1R6 SDCTPDCT II

nIV

n )()(

)(4

IIDCT)(2

IVDCT

Car

neg

ie M

ello

n

FormulaFormula for a DCT, size 16 for a DCT, size 16

Car

neg

ie M

ello

n

NumberNumber of Formulas/Algorithms of Formulas/Algorithms

k

123456789

# DFTs, size 2^k

16

40296

27744162570361280~1.01 • 10^27~2.31 • 10^61

~2.86 • 10^133

# DCT IV, size 2^k

110

12631242

19244433627343815121631354242

~1.07 • 10^38~2.30 • 10^76

~1.06 • 10^153

exponential search space

Using the rules included in SPIRAL:

Car

neg

ie M

ello

n

Algorithm (Formula)

Implementation

DSP Transform

Car

neg

ie M

ello

nFormulasFormulas in SPL in SPL

( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )

• • • •

• • • •

Car

neg

ie M

ello

nSPLSPL Syntax (Subset) Syntax (Subset)

matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off

allows extension of SPL

controls loop unrolling

Car

neg

ie M

ello

n

SPLSPL Compiler, 4-point FFT Compiler, 4-point FFT

(compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2))

f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4

r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6

fast algorithmas

formulaas

SPL program#codetype

complex real

Car

neg

ie M

ello

nSPLSPL Compiler: Summary Compiler: Summary

Parsing

Intermediate Code Generation

Intermediate Code Restructuring

Target Code Generation

Symbol Table Abstract Syntax Tree

I-Code

I-Code

C, FORTRAN function

Template Table

SPL Formula Template DefinitionSymbol Definition

OptimizationI-Code

SPL Program

Built-in optimizations:

single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE

Extensible through templates

Car

neg

ie M

ello

n

SIMDSIMD Short Vector Extensions Short Vector Extensions

+ x

vector length = 4(4-way)

Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Requires fine grain parallelism Large potential speed-up

SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic (compiler) vectorization very limited

Problems:

very difficult to use

Car

neg

ie M

ello

nVector Vector code generation from SPL formulascode generation from SPL formulas

Naturally vectorizable construct

A

x y

4IAvector length

iiii

k

ii QEIADP )(

1

Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length

(Current) generic construct completely vectorizable:

Vectorization in two steps:

1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)

Car

neg

ie M

ello

n

Algorithm (Formula)

Implementation

DSP Transform

Search

Car

neg

ie M

ello

nWhyWhy Search? Search?

DCT, type IV, size 16

maaaany different formulas large spread in runtimes, even for modest size precisely equal arithmetic cost best formula is platform-dependent

~31000 formulas

Toy problem:scheduled

Car

neg

ie M

ello

nSearchSearch Methods available in SPIRAL Methods available in SPIRAL

Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm)

Possible FormulasSizes Timed Results

Exhaust Very small All Best

DP All 10s-100s (very) good

Random All User decided fair/good

Hill Climbing All 100s-1000s Good

STEER All 100s-1000s (very) good

Search over • algorithm space and • implementation options (degree of unrolling)

Car

neg

ie M

ello

nSTEERSTEER

Population n:

Population n+1:

……

……

Mutation

Cross-Breeding

expanddifferently

swapexpansions

Survival of Fittest

Car

neg

ie M

ello

n

LearningLearning to Generate Fast Algorithms to Generate Fast Algorithms

• Learns from given dataset (formulas + runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT

Car

neg

ie M

ello

n

Experimental ResultsExperimental Results

Car

neg

ie M

ello

nGeneratedGenerated DFT Code: Pentium 4, SSE DFT Code: Pentium 4, SSE

(Pse

ud

o)

gfl

op

/s

DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (vector to C code) up to factor of 3.1

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13

Spiral SSE

Intel MKL interl.

FFTW 2.1.3

Spiral C

Spiral C vect

SIMD-FFT

hand-tuned vendor assembly code

*P. Rodriguez. A Radix-2 FFT Algorithm for Modern Single Instruction Multiple Data (SIMD) Architectures. Proc. ICASSP 2002

*

Car

neg

ie M

ello

nGeneratedGenerated DFT Code: Pentium 4, SSE2 DFT Code: Pentium 4, SSE2

gfl

op

s

DFT 2n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (vector to C code) up to factor of 1.8

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4 5 6 7 8 9 10 11 12

Spiral SSE2

Spiral SSE2 exh.

Intel MKL split

FFTW 2.1.3

Spiral C

Spiral C vect

Car

neg

ie M

ello

n

Other Other transformstransforms

0

0.5

1

1.5

2

2.5

3

3.5

4 5 6 7 8 9 10 11 12 13 14

Spiral float

Spiral C vect

Spiral C SSE

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

4x4 8x8 16x16 32x32 64x64 128x128

Spiral SSE

Spiral C vect

Spiral C float

gfl

op

s

transform size transform size

2-dim DCT 2n x 2n

Pentium 4, 2.53 GHz, SSEWHT 2n

Pentium 4, 2.53 GHz, SSE

speedups (vector to C code) up to factor of 3

WHT has only additions very simple transform

Car

neg

ie M

ello

nBest Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024

scalar

C vect

SIMD

Pentium 4float

Pentium 4double

Pentium IIIfloat

AthlonXPfloat

10

6

4

4

10

8

7

5

10

8

6

4

2

2

2

2 2

2 2

2 2

2

2

1

2

2 3

10

8

5

2

2

2 3

10

9

7

5

1

2

22 3

10

6

2

4

2 2

2 2

4

10

4

2

6

2 4

2 2

2

10

5

2

5

2 3 3

10

6

3

4

2 2 3

10

8

5

2

2

2 3

10

6

4

4

2 2

2 2

2

10

5

2

5

2 3 3

trees platform/datatype dependent

Car

neg

ie M

ello

nCrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4

Rel

ativ

e p

erfo

rman

ce w

.r.t

. b

est

DFT 2n single precision, runtime of best found of other platforms

n

software adaptation is necessary

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

4 5 6 7 8 9 10 11 12 13

Pentium 4 SSE

Pentium 4 SSE2

AthlonXP SSE

PentiumIII SSE

Pentium 4 float

e.g., ~50% performance loss byusing PIII code on P4

Car

neg

ie M

ello

nConclusionsConclusions

Mathematical computer representation of algorithms

Automatic translation of algorithms into code

SPIRAL closes the gap between math domain (algorithms) andimplementation domain (programs)

High level: Mathematical manipulation of algorithms

Low level: Coding degrees of freedom

SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives


Car

neg

ie M

ello

nReferencesReferences

Related Work• R.C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing 1998. Math-atlas.sourceforge.net• M. Frigo and S.-G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. ICASSP 1998, pp. 1381-1384. www.fftw.org• E.-J. Im and K. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proc. ICCS 2001, pp. 127-136.

Further Reading on SPIRAL

• M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. To appear in Journal of High Performance Computing and Applications.• J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. PLDI 2001, pp. 298-308.• Bryan Singer and Manuela Veloso. Automating the Modeling and Optimization of the Performance of Signal Transforms. IEEE Trans. Signal Processing, 50(8), 2002, pp. 2003-2014.• F. Franchetti and M. Püschel. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proc. IPDPS 2002.• F. Franchetti and M. Püschel. Short Vector Code Generation for the Discrete Fourier Transform. To appear in Proc. IPDPS 2003.


Documents

Carnegie Mellon SPIRAL: An Overview José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus