Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Fast Fourier Transform Compiler
Matteo Frigo
Supercomputing Technologies GroupMIT Laboratory for Computer Science
1
genfft
genfft is a domain-specific FFT compiler.
� Generates fast C code for Fourier transforms of arbitrary size.
� Written in Objective Caml 2.0 [Leroy 1998], a dialect of ML.
� “Discovered” previously unknown FFT algorithms.
� From a complex-number FFT algorithm it derives a real-numberalgorithm [Sorensen et al., 1987] automatically.
2
genfft used in FFTWgenfft produced the “inner loop” of FFTW (95% of the code).
� FFTW [Frigo and Johnson, 1997–1999] is a library for comput-ing one- and multidimensional real and complex discrete Fouriertransforms of arbitrary size.
� FFTW adapts itself to the hardware automatically, giving porta-bility and speed at the same time.
� Parallel versions are available for Cilk, Posix threads and MPI.
� Code, documentation and benchmark results at
http���theory�lcs�mit�edu��fftw
genfft makes FFTW possible.
3
FFTW’s performanceFFTW is faster than all other portable FFT codes, and it is compara-ble with machine-specific libraries provided by vendors.
B
B
B
B
B
B
BB
BB B
B
B
B
BB
B B B B B B
J
J
J
JJ
J
J
J
JJ
JJ
J
J
J J J J J JJ J
H
H
H
HH
HH
H HH
HH
HH
HH
H H H HH H
F
F
F
FF
F
F
F
FF
FF
F
FF F F F F F
F F
MM
M
MM
M M
MM
MM
M MM
MM M M M M M M
7 7 7 7 77 7 7 7 7 7 7 7
7 7 7 7 7 7
2 4 8
16 32 64
128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
2097
152
4194
304
0
50
100
150
200
250m
flops
Transform Size
B FFTW
J SUNPERF
H Ooura
F FFTPACK
Green
Sorensen
Singleton
Krukar
M Temperton
Bailey
NR
7 QFT
Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision.
4
FFTW’s performance
B
B
B
B
BB
BB
B B
B
BB B
B
B
B B B B
EE
E
E
E
E
EE
E
E E
EE
E
E
E
EE E E
0
50
100
150
200
250
transform size
B FFTW
E SUNPERF
FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra-SPARC I, single precision.
5
Why FFTW is fastFFTW does not implement a single fixed algorithm. Instead, thetransform is computed by an executor, composed of highly opti-mized, composable blocks of C code called codelets. Roughly speak-ing, a codelet computes a Fourier transform of small size.
At runtime, a planner finds a fast composition of codelets. The plan-ner measures the speed of different plans and chooses the fastest.
FFTW contains 120 codelets for a total of 56215 lines of op-timized code.
6
Real FFTs are complexThe data-flow graph of the Fourier transform is not a butterfly, it is amess!
Complex butterfly
The FFT is even messier when the size isnot a power of �, or when the input con-sists of real numbers.
Real and imaginary parts
7
genfft’s compilation strategy1. Create a dag for an FFT algorithm in the simplest possible way.
Do not bother optimizing.
2. Simplify the dag.
3. Schedule the dag, using a register-oblivious schedule.
4. Unparse to C. (Could also unparse to FORTRAN or other lan-guages.)
8
The case for genfft
Why not rely on the C compiler to optimize codelets?
� FFT algorithms are best expressed recursively, but C compilersare not good at unrolling recursion.
� The FFT needs very long code sequences to use all registers ef-fectively. (FFTW on Alpha EV56 computes FFT(64) with 2400lines of straight C code.)
� Register allocators choke if you feed them with long code se-quences. In the case of the FFT, however, we know the (asymp-totically) optimal register allocation.
� Certain optimizations only make sense for FFTs.
Since C compilers are good at instruction selection and instructionscheduling, I did not need to implement these steps in genfft.
9
Outline
� Dag creation.
� The simplifier.
� Register-oblivious scheduling.
� Related work.
� Conclusion.
10
Dag representationA dag is represented as a data type node. (Think of a syntax tree ora symbolic algebra system.)
type node �
Num of Number�number � Load of Variable�variable
� Plus of node list � Times of node � node � ���
��
v�v�
v�
v�v�
v� � Plus v� Times �Num � v���
v� � Plus Times �Num � v�� v� v��
(All numbers are real.)
An abstraction layer implements complex arithmetic on pairs of dagnodes.
11
Dag creationThe function fftgen performs a complete symbolic evaluation of anFFT algorithm, yielding a dag.
No single FFT algorithm is good for all input sizes n. genfft con-tains many algorithms, and fftgen dispatches the most appropriate.
� Cooley-Tukey [1965] (works for n � pq).
� Prime factor [Good, 1958] (works for n � pq when gcd�p� q� �
�).
� Split-radix [Duhamel and Hollman, 1984] (works for n � �k).
� Rader [1968] (works when n is prime).
12
Cooley-Tukey FFT algorithmDSP book:
yk �n��X
j��xj�jk
n �p��X
j����
��
� q��Xj���
xpj��j��j�k�
q
�A�j�k�
n
���j�k�
p �
where n � pq and k � k� � qk�.
OCaml code:
let cooley�tukey n p q x �
let inner j� � fftgen q
�fun j� �� x �p � j� � j��� in
let twiddle k� j� �
�omega n �j� � k��� �� �inner j� k�� in
let outer k� � fftgen p �twiddle k�� in
�fun k �� outer �k mod q� �k � q��13
Outline
� Dag creation.
� The simplifier.
� Register-oblivious scheduling.
� Related work.
� Conclusion.
14
The simplifierWell-known optimizations:
� Algebraic simplification, e.g., a� � a.
� Constant folding.
� Common-subexpression elimination. (In the current implemen-tation, CSE is encapsulated into a monad.)
Specific to the FFT:
� (Elimination of negative constants.)
� Network transposition.
Why did I implement these well-known optimizations, whichare implemented in C compilers anyway?Because the the specific optimizations can only be done afterthe well-known optimizations.
15
Network transpositionA network is a dag that computes a linear function [Crochiere andOppenheim, 1975]. (Recall that the FFT is linear.)
Reversing all edges yields a transposed network.x
y
st
st
�
� �
xy
�
�
xy
st
xy
�
�
�
st
�
�
If a network computes X � AY , the transposed network computes
Y � ATX .16
Optimize the transposed dag!
genfft employs this dag optimization algorithm:
OPTIMIZE(G) �
G� �� SIMPLIFY(G)
GT� �� SIMPLIFY(GT� )
RETURN SIMPLIFY(G�)
(Iterating the transposition has no effect.)
17
Benefits of network transposition
genfft literaturesize adds muls muls savings adds muls
original transposedcomplex to complex5 32 16 12 25% 34 1010 84 32 24 25% 88 2013 176 88 68 23% 214 7615 156 68 56 18% 162 50real to complex5 12 8 6 25% 17 510 34 16 12 25% ? ?13 76 44 34 23% ? ?15 64 31 25 19% 67 25
18
Outline
� Dag creation.
� The simplifier.
� Register-oblivious scheduling.
� Related work.
� Conclusion.
19
How to help register allocatorsgenfft’s scheduler reorders the dag so that the register allocator of
the C compiler can minimize the number of register spills.
genfft’s recursive partitioning heuristic is asymptotically optimalfor n � �k.
genfft’s schedule is register-oblivious: This schedule does not de-pend on the number R of registers, and yet it is optimal for all R.
Certain C compilers insist on imposing their own schedule.With egcs ���/SPARC, FFTW is 50-100% faster if I use
�fno�schedule�insns.
(How do I persuade the SGI compiler to respect my sched-ule?)
20
Recursive partitioningThe scheduler partitions the dag by cutting it in the “middle.” Thisoperation creates connected components that are scheduled recur-sively.
21
Register-oblivious schedulingAssume that we want to compute the FFT of n points, and a computerhas R registers, where n � R.
Lower bound [Hong and Kung, 1981]. If n � �k, the n-point FFTrequires at least ��n lg n� lgR� register spills.
Upper bound. If n � �k, genfft’s output program for n-point FFTincurs at most O�n lg n� lgR� register spills.
A more general theory of cache-oblivious algorithms exists [Frigo,Leiserson, Prokop, and Ramachandran, 1999].
22
Related work� FOURGEN [Maruhn, 1976]. Written in PL/I, generates FOR-
TRAN transforms of size �k.
� [Johnson and Burrus, 1983] automate the design of DFT pro-grams using dynamic programming.
� [Selesnick and Burrus, 1986] generate MATLAB programs forFFTs of certain sizes.
� [Perez and Takaoka, 1987] generated prime-factor FFT programs.
� [Veldhuizen, 1995] uses a template metaprograms technique togenerate C++.
� EXTENT [Gupta et al., 1996] compiles a tensor-product lan-guage into FORTRAN.
23
ConclusionA domain-specific compiler is a valuable tool.
� For the FFT on modern processors, performance requires straight-line sequences with hundreds of instructions.
� Correctness was surprisingly easy.
� Rapid turnaround. (About 15 minutes to regenerate FFTW.)
� We can implement problem-specific code improvements.
� genfft derives new algorithms.
24