24
A Fast Fourier Transform Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science 1

A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

A Fast Fourier Transform Compiler

Matteo Frigo

Supercomputing Technologies GroupMIT Laboratory for Computer Science

1

Page 2: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

genfft

genfft is a domain-specific FFT compiler.

� Generates fast C code for Fourier transforms of arbitrary size.

� Written in Objective Caml 2.0 [Leroy 1998], a dialect of ML.

� “Discovered” previously unknown FFT algorithms.

� From a complex-number FFT algorithm it derives a real-numberalgorithm [Sorensen et al., 1987] automatically.

2

Page 3: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

genfft used in FFTWgenfft produced the “inner loop” of FFTW (95% of the code).

� FFTW [Frigo and Johnson, 1997–1999] is a library for comput-ing one- and multidimensional real and complex discrete Fouriertransforms of arbitrary size.

� FFTW adapts itself to the hardware automatically, giving porta-bility and speed at the same time.

� Parallel versions are available for Cilk, Posix threads and MPI.

� Code, documentation and benchmark results at

http���theory�lcs�mit�edu��fftw

genfft makes FFTW possible.

3

Page 4: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

FFTW’s performanceFFTW is faster than all other portable FFT codes, and it is compara-ble with machine-specific libraries provided by vendors.

B

B

B

B

B

B

BB

BB B

B

B

B

BB

B B B B B B

J

J

J

JJ

J

J

J

JJ

JJ

J

J

J J J J J JJ J

H

H

H

HH

HH

H HH

HH

HH

HH

H H H HH H

F

F

F

FF

F

F

F

FF

FF

F

FF F F F F F

F F

MM

M

MM

M M

MM

MM

M MM

MM M M M M M M

7 7 7 7 77 7 7 7 7 7 7 7

7 7 7 7 7 7

2 4 8

16 32 64

128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

4194

304

0

50

100

150

200

250m

flops

Transform Size

B FFTW

J SUNPERF

H Ooura

F FFTPACK

Green

Sorensen

Singleton

Krukar

M Temperton

Bailey

NR

7 QFT

Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision.

4

Page 5: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

FFTW’s performance

B

B

B

B

BB

BB

B B

B

BB B

B

B

B B B B

EE

E

E

E

E

EE

E

E E

EE

E

E

E

EE E E

0

50

100

150

200

250

transform size

B FFTW

E SUNPERF

FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra-SPARC I, single precision.

5

Page 6: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Why FFTW is fastFFTW does not implement a single fixed algorithm. Instead, thetransform is computed by an executor, composed of highly opti-mized, composable blocks of C code called codelets. Roughly speak-ing, a codelet computes a Fourier transform of small size.

At runtime, a planner finds a fast composition of codelets. The plan-ner measures the speed of different plans and chooses the fastest.

FFTW contains 120 codelets for a total of 56215 lines of op-timized code.

6

Page 7: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Real FFTs are complexThe data-flow graph of the Fourier transform is not a butterfly, it is amess!

Complex butterfly

The FFT is even messier when the size isnot a power of �, or when the input con-sists of real numbers.

Real and imaginary parts

7

Page 8: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

genfft’s compilation strategy1. Create a dag for an FFT algorithm in the simplest possible way.

Do not bother optimizing.

2. Simplify the dag.

3. Schedule the dag, using a register-oblivious schedule.

4. Unparse to C. (Could also unparse to FORTRAN or other lan-guages.)

8

Page 9: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

The case for genfft

Why not rely on the C compiler to optimize codelets?

� FFT algorithms are best expressed recursively, but C compilersare not good at unrolling recursion.

� The FFT needs very long code sequences to use all registers ef-fectively. (FFTW on Alpha EV56 computes FFT(64) with 2400lines of straight C code.)

� Register allocators choke if you feed them with long code se-quences. In the case of the FFT, however, we know the (asymp-totically) optimal register allocation.

� Certain optimizations only make sense for FFTs.

Since C compilers are good at instruction selection and instructionscheduling, I did not need to implement these steps in genfft.

9

Page 10: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Outline

� Dag creation.

� The simplifier.

� Register-oblivious scheduling.

� Related work.

� Conclusion.

10

Page 11: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Dag representationA dag is represented as a data type node. (Think of a syntax tree ora symbolic algebra system.)

type node �

Num of Number�number � Load of Variable�variable

� Plus of node list � Times of node � node � ���

��

v�v�

v�

v�v�

v� � Plus v� Times �Num � v���

v� � Plus Times �Num � v�� v� v��

(All numbers are real.)

An abstraction layer implements complex arithmetic on pairs of dagnodes.

11

Page 12: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Dag creationThe function fftgen performs a complete symbolic evaluation of anFFT algorithm, yielding a dag.

No single FFT algorithm is good for all input sizes n. genfft con-tains many algorithms, and fftgen dispatches the most appropriate.

� Cooley-Tukey [1965] (works for n � pq).

� Prime factor [Good, 1958] (works for n � pq when gcd�p� q� �

�).

� Split-radix [Duhamel and Hollman, 1984] (works for n � �k).

� Rader [1968] (works when n is prime).

12

Page 13: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Cooley-Tukey FFT algorithmDSP book:

yk �n��X

j��xj�jk

n �p��X

j����

��

� q��Xj���

xpj��j��j�k�

q

�A�j�k�

n

���j�k�

p �

where n � pq and k � k� � qk�.

OCaml code:

let cooley�tukey n p q x �

let inner j� � fftgen q

�fun j� �� x �p � j� � j��� in

let twiddle k� j� �

�omega n �j� � k��� �� �inner j� k�� in

let outer k� � fftgen p �twiddle k�� in

�fun k �� outer �k mod q� �k � q��13

Page 14: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Outline

� Dag creation.

� The simplifier.

� Register-oblivious scheduling.

� Related work.

� Conclusion.

14

Page 15: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

The simplifierWell-known optimizations:

� Algebraic simplification, e.g., a� � a.

� Constant folding.

� Common-subexpression elimination. (In the current implemen-tation, CSE is encapsulated into a monad.)

Specific to the FFT:

� (Elimination of negative constants.)

� Network transposition.

Why did I implement these well-known optimizations, whichare implemented in C compilers anyway?Because the the specific optimizations can only be done afterthe well-known optimizations.

15

Page 16: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Network transpositionA network is a dag that computes a linear function [Crochiere andOppenheim, 1975]. (Recall that the FFT is linear.)

Reversing all edges yields a transposed network.x

y

st

st

� �

xy

xy

st

xy

st

If a network computes X � AY , the transposed network computes

Y � ATX .16

Page 17: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Optimize the transposed dag!

genfft employs this dag optimization algorithm:

OPTIMIZE(G) �

G� �� SIMPLIFY(G)

GT� �� SIMPLIFY(GT� )

RETURN SIMPLIFY(G�)

(Iterating the transposition has no effect.)

17

Page 18: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Benefits of network transposition

genfft literaturesize adds muls muls savings adds muls

original transposedcomplex to complex5 32 16 12 25% 34 1010 84 32 24 25% 88 2013 176 88 68 23% 214 7615 156 68 56 18% 162 50real to complex5 12 8 6 25% 17 510 34 16 12 25% ? ?13 76 44 34 23% ? ?15 64 31 25 19% 67 25

18

Page 19: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Outline

� Dag creation.

� The simplifier.

� Register-oblivious scheduling.

� Related work.

� Conclusion.

19

Page 20: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

How to help register allocatorsgenfft’s scheduler reorders the dag so that the register allocator of

the C compiler can minimize the number of register spills.

genfft’s recursive partitioning heuristic is asymptotically optimalfor n � �k.

genfft’s schedule is register-oblivious: This schedule does not de-pend on the number R of registers, and yet it is optimal for all R.

Certain C compilers insist on imposing their own schedule.With egcs ���/SPARC, FFTW is 50-100% faster if I use

�fno�schedule�insns.

(How do I persuade the SGI compiler to respect my sched-ule?)

20

Page 21: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Recursive partitioningThe scheduler partitions the dag by cutting it in the “middle.” Thisoperation creates connected components that are scheduled recur-sively.

21

Page 22: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Register-oblivious schedulingAssume that we want to compute the FFT of n points, and a computerhas R registers, where n � R.

Lower bound [Hong and Kung, 1981]. If n � �k, the n-point FFTrequires at least ��n lg n� lgR� register spills.

Upper bound. If n � �k, genfft’s output program for n-point FFTincurs at most O�n lg n� lgR� register spills.

A more general theory of cache-oblivious algorithms exists [Frigo,Leiserson, Prokop, and Ramachandran, 1999].

22

Page 23: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

Related work� FOURGEN [Maruhn, 1976]. Written in PL/I, generates FOR-

TRAN transforms of size �k.

� [Johnson and Burrus, 1983] automate the design of DFT pro-grams using dynamic programming.

� [Selesnick and Burrus, 1986] generate MATLAB programs forFFTs of certain sizes.

� [Perez and Takaoka, 1987] generated prime-factor FFT programs.

� [Veldhuizen, 1995] uses a template metaprograms technique togenerate C++.

� EXTENT [Gupta et al., 1996] compiles a tensor-product lan-guage into FORTRAN.

23

Page 24: A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

ConclusionA domain-specific compiler is a valuable tool.

� For the FFT on modern processors, performance requires straight-line sequences with hundreds of instructions.

� Correctness was surprisingly easy.

� Rapid turnaround. (About 15 minutes to regenerate FFTW.)

� We can implement problem-specific code improvements.

� genfft derives new algorithms.

24