A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing

A Fast Fourier Transform Compiler

Matteo Frigo

Supercomputing Technologies GroupMIT Laboratory for Computer Science

1

genfft

genfft is a domain-specific FFT compiler.

� Generates fast C code for Fourier transforms of arbitrary size.

� Written in Objective Caml 2.0 [Leroy 1998], a dialect of ML.

� “Discovered” previously unknown FFT algorithms.

� From a complex-number FFT algorithm it derives a real-numberalgorithm [Sorensen et al., 1987] automatically.

2

genfft used in FFTWgenfft produced the “inner loop” of FFTW (95% of the code).

� FFTW [Frigo and Johnson, 1997–1999] is a library for comput-ing one- and multidimensional real and complex discrete Fouriertransforms of arbitrary size.

� FFTW adapts itself to the hardware automatically, giving porta-bility and speed at the same time.

� Parallel versions are available for Cilk, Posix threads and MPI.

� Code, documentation and benchmark results at

http��theory�lcs�mit�edu��fftw

genfft makes FFTW possible.

3

FFTW’s performanceFFTW is faster than all other portable FFT codes, and it is compara-ble with machine-specific libraries provided by vendors.

B

B

B

B

B

B

BB

BB B

B

B

B

BB

B B B B B B

J

J

J

JJ

J

J

J

JJ

JJ

J

J

J J J J J JJ J

H

H

H

HH

HH

H HH

HH

HH

HH

H H H HH H

F

F

F

FF

F

F

F

FF

FF

F

FF F F F F F

F F

MM

M

MM

M M

MM

MM

M MM

MM M M M M M M

7 7 7 7 77 7 7 7 7 7 7 7

7 7 7 7 7 7

2 4 8

16 32 64

128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

4194

304

0

50

100

150

200

250m

flops

Transform Size

B FFTW

J SUNPERF

H Ooura

F FFTPACK

Green

Sorensen

Singleton

Krukar

M Temperton

Bailey

NR

7 QFT

Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision.

4

FFTW’s performance

B

B

B

B

BB

BB

B B

B

BB B

B

B

B B B B

EE

E

E

E

E

EE

E

E E

EE

E

E

E

EE E E

0

50

100

150

200

250

transform size

B FFTW

E SUNPERF

FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra-SPARC I, single precision.

5

Why FFTW is fastFFTW does not implement a single fixed algorithm. Instead, thetransform is computed by an executor, composed of highly opti-mized, composable blocks of C code called codelets. Roughly speak-ing, a codelet computes a Fourier transform of small size.

At runtime, a planner finds a fast composition of codelets. The plan-ner measures the speed of different plans and chooses the fastest.

FFTW contains 120 codelets for a total of 56215 lines of op-timized code.

6

Real FFTs are complexThe data-flow graph of the Fourier transform is not a butterfly, it is amess!

Complex butterfly

The FFT is even messier when the size isnot a power of �, or when the input con-sists of real numbers.

Real and imaginary parts

7

genfft’s compilation strategy1. Create a dag for an FFT algorithm in the simplest possible way.

Do not bother optimizing.

2. Simplify the dag.

3. Schedule the dag, using a register-oblivious schedule.

4. Unparse to C. (Could also unparse to FORTRAN or other lan-guages.)

8

The case for genfft

Why not rely on the C compiler to optimize codelets?

� FFT algorithms are best expressed recursively, but C compilersare not good at unrolling recursion.

� The FFT needs very long code sequences to use all registers ef-fectively. (FFTW on Alpha EV56 computes FFT(64) with 2400lines of straight C code.)

� Register allocators choke if you feed them with long code se-quences. In the case of the FFT, however, we know the (asymp-totically) optimal register allocation.

� Certain optimizations only make sense for FFTs.

Since C compilers are good at instruction selection and instructionscheduling, I did not need to implement these steps in genfft.

9

Outline

� Dag creation.

� The simplifier.

� Register-oblivious scheduling.

� Related work.

� Conclusion.

10

Dag representationA dag is represented as a data type node. (Think of a syntax tree ora symbolic algebra system.)

type node �

Num of Number�number � Load of Variable�variable

� Plus of node list � Times of node � node � ��

��

v�v�

v�

v�v�

v� � Plus v� Times �Num � v��

v� � Plus Times �Num � v�� v� v��

(All numbers are real.)

An abstraction layer implements complex arithmetic on pairs of dagnodes.

11

Dag creationThe function fftgen performs a complete symbolic evaluation of anFFT algorithm, yielding a dag.

No single FFT algorithm is good for all input sizes n. genfft con-tains many algorithms, and fftgen dispatches the most appropriate.

� Cooley-Tukey [1965] (works for n � pq).

� Prime factor [Good, 1958] (works for n � pq when gcd�p� q� �

�).

� Split-radix [Duhamel and Hollman, 1984] (works for n � �k).

� Rader [1968] (works when n is prime).

12

Cooley-Tukey FFT algorithmDSP book:

yk �n��X

j��xj�jk

n �p��X

j��

��

� q��Xj��

xpj��j��j�k�

q

�A�j�k�

n

��j�k�

p �

where n � pq and k � k� � qk�.

OCaml code:

let cooley�tukey n p q x �

let inner j� � fftgen q

�fun j� �� x �p � j� � j�� in

let twiddle k� j� �

�omega n �j� � k�� inner j� k�� in

let outer k� � fftgen p �twiddle k�� in

�fun k �� outer �k mod q� �k � q��13

Outline

� Dag creation.

� The simplifier.


� Related work.

� Conclusion.

14

The simplifierWell-known optimizations:

� Algebraic simplification, e.g., a� � a.

� Constant folding.

� Common-subexpression elimination. (In the current implemen-tation, CSE is encapsulated into a monad.)

Specific to the FFT:

� (Elimination of negative constants.)

� Network transposition.

Why did I implement these well-known optimizations, whichare implemented in C compilers anyway?Because the the specific optimizations can only be done afterthe well-known optimizations.

15

Network transpositionA network is a dag that computes a linear function [Crochiere andOppenheim, 1975]. (Recall that the FFT is linear.)

Reversing all edges yields a transposed network.x

y

st

st

�

� �

xy

�

�

xy

st

xy

�

�

�

st

�

�

If a network computes X � AY , the transposed network computes

Y � ATX .16

Optimize the transposed dag!

genfft employs this dag optimization algorithm:

OPTIMIZE(G) �

G� �� SIMPLIFY(G)

GT� �� SIMPLIFY(GT� )

RETURN SIMPLIFY(G�)

(Iterating the transposition has no effect.)

17

Benefits of network transposition

genfft literaturesize adds muls muls savings adds muls

original transposedcomplex to complex5 32 16 12 25% 34 1010 84 32 24 25% 88 2013 176 88 68 23% 214 7615 156 68 56 18% 162 50real to complex5 12 8 6 25% 17 510 34 16 12 25% ? ?13 76 44 34 23% ? ?15 64 31 25 19% 67 25

18

Outline

� Dag creation.

� The simplifier.


� Related work.

� Conclusion.

19

How to help register allocatorsgenfft’s scheduler reorders the dag so that the register allocator of

the C compiler can minimize the number of register spills.

genfft’s recursive partitioning heuristic is asymptotically optimalfor n � �k.

genfft’s schedule is register-oblivious: This schedule does not de-pend on the number R of registers, and yet it is optimal for all R.

Certain C compilers insist on imposing their own schedule.With egcs ��/SPARC, FFTW is 50-100% faster if I use

�fno�schedule�insns.

(How do I persuade the SGI compiler to respect my sched-ule?)

20

Recursive partitioningThe scheduler partitions the dag by cutting it in the “middle.” Thisoperation creates connected components that are scheduled recur-sively.

21

Register-oblivious schedulingAssume that we want to compute the FFT of n points, and a computerhas R registers, where n � R.

Lower bound [Hong and Kung, 1981]. If n � �k, the n-point FFTrequires at least ��n lg n� lgR� register spills.

Upper bound. If n � �k, genfft’s output program for n-point FFTincurs at most O�n lg n� lgR� register spills.

A more general theory of cache-oblivious algorithms exists [Frigo,Leiserson, Prokop, and Ramachandran, 1999].

22

Related work� FOURGEN [Maruhn, 1976]. Written in PL/I, generates FOR-

TRAN transforms of size �k.

� [Johnson and Burrus, 1983] automate the design of DFT pro-grams using dynamic programming.

� [Selesnick and Burrus, 1986] generate MATLAB programs forFFTs of certain sizes.

� [Perez and Takaoka, 1987] generated prime-factor FFT programs.

� [Veldhuizen, 1995] uses a template metaprograms technique togenerate C++.

� EXTENT [Gupta et al., 1996] compiles a tensor-product lan-guage into FORTRAN.

23

ConclusionA domain-specific compiler is a valuable tool.

� For the FFT on modern processors, performance requires straight-line sequences with hundreds of instructions.

� Correctness was surprisingly easy.

� Rapid turnaround. (About 15 minutes to regenerate FFTW.)

� We can implement problem-specific code improvements.

� genfft derives new algorithms.

24

Documents

A Fast Fourier Transform Compilerjjohnson/2009-10/winter/... · ’s compilation strategy 1. Create a dag for an FFT algorithm in the simplest possible way. Do not bother optimizing