Four Easy Ways to a Faster FFT

  • Published on
    02-Aug-2016

  • View
    221

  • Download
    6

Embed Size (px)

Transcript

  • Journal of Mathematical Modelling and Algorithms 1: 193214, 2002. 2002 Kluwer Academic Publishers. Printed in the Netherlands. 193

    Four Easy Ways to a Faster FFT

    LENORE R. MULLIN and SHARON G. SMALLComputer Science Department, University at Albany, SUNY, Albany, NY 12222, U.S.A. e-mail:{lenore, small}@cs.albany.edu

    (Received: 25 September 2001; in final form: 21 May 2002)Abstract. The Fast Fourier Transform (FFT) was named one of the Top Ten algorithms of the20th century, and continues to be a focus of current research. A problem with currently used FFTpackages is that they require large, finely tuned, machine specific libraries, produced by highly skilledsoftware developers. Therefore, these packages fail to perform well across a variety of architectures.Furthermore, many need to run repeated experiments in order to re-program their code to its optimalperformance based on a given machines underlying hardware. Finally, it is difficult to know whichradix to use given a particular vector size and machine configuration. We propose the use of mono-lithic array analysis as a way to remove the constraints imposed on performance by a machinesunderlying hardware, by pre-optimizing array access patterns. In doing this we arrive at a singleoptimized program. We have achieved up to a 99.6% increase in performance, and the ability torun vectors up to 8 388 608 elements larger, on our experimental platforms. Preliminary experimentsindicate different radices perform better relative to a machines underlying architecture.

    Mathematics Subject Classifications (2000): 18C99, 68Q99, 55P55, 42A99, 18B99, 20J99.Key words: performance analysis and evaluation, signal and image processing, FFT, array algebra,linear transformations, radix n, MOA, Psi calculus, shape polymorphism, shapes.

    1. Introduction

    Using our approach we have provided a new highly competitive version of the FFT,which performs well for all input vectors, without any knowledge of the underlyingarchitecture or the use of heuristics. Our method is based upon successive refine-ments, often involving transformations which use an associated array and indexcalculii, leading from the algebraic specification of the problem.

    In general, our results indicate our approach can be beneficial to algorithmswhich have a common underlying algebraic structure and semantics. The FFT isan example of this type of algorithm and is the focus of our initial investigations.

    We begin by discussing related work. We then describe the FFT, a Mathematicsof Arrays (MOA), and the Psi Calculus. We show how monolithic array analysis viaMOA creates the Three Easy Steps [33, 34], and how it can be used to extend thesteps to include a Fourth Step, the general radix FFT. Through experimentation on

    This work was partially supported by NSF CCR-9596184.

  • 194 LENORE R. MULLIN AND SHARON G. SMALL

    High Performance Computing (HPC) machines, we show how our approach out-performs current competitors. Finally, we discuss the direction of future research.

    Several authors [12, 41, 42] have developed algebraic frameworks for a num-ber of these transforms and their discrete versions. Using their framework, manyvariant algorithms for the discrete versions of these transforms can be viewed ascomputing tensor products, using very similar data flow. Due to the importance ofthe FFT, numerous attempts have been made to optimize both performance andspace [10, 14, 16, 18].

    There are various commercial and vendor specific libraries which include theFFT. The Numerical Analysis Group (NAG) and Visual Numerics (IMSL) provideand support finely tuned scientific libraries specific to various HPC platforms, e.g.SGI and IBM. The SGI and IBM libraries, SCSL and ESSL, are even more highlytuned, due to their knowledge of proprietary information specific to their respectivearchitectures. Another package, FFTW, has proven to be our leading competitor.Consequently, we have chosen to include all these packages in our comparativebenchmark studies.

    Speed-ups due to parallelism have also been investigated [2, 9, 15, 17, 28].Language and compiler optimizations enable programs, in general, to obtain speed-ups. ZPL [57, 25], and C++ templates [44, 45], represent such features. TheBlitz++ system [46] provides generic C++ array objects, using expression tem-plates [45] to generate efficient code for array expressions. This system performsmany optimizations, including tiling and stride optimizations. C++ templates arealso used in POOMA, POOMA II [21, 23], and the Matrix Template Library (MTL)[26, 27]. MTL is a system that handles dense and sparse matrices, and uses templatemetaprograms to generate tiled algorithms for dense matrices.

    Another approach for automatic generation and optimization of code to matchthe memory hierarchy and exploit pipelining is exemplified in ATLAS [47], whichoptimizes code for matrix multiply. In generating code, the ATLAS optimizer usestimings to determine blocking and loop unrolling factors. Another system thatgenerates matrix-multiply code matched to target architecture is PhiPAC [3]. HPF[20], MPI and OpenMP [1] use decomposition strategies to give the programmermechanisms to partition the problem based on their understanding of an algo-rithms memory access patterns. Skjellum [24, 39] uses polyalgorithms to achieveoptimizations across varying architectures.

    The Fourier Transform is based on Gauss theory that any periodic wave may berepresented as a summation of a number of sine and cosine waves of different fre-quencies. The Fourier Transform computes the Fourier coefficients needed for thissummation. This transform takes samples of a signal over time and produces theFourier coefficients, allowing us to create a signals representation over a frequencyvariant. In its original formulation the Discrete Fourier Transform (DFT) hasO(n2)complexity. By analyzing the sparse matrix multiplication in the DFT and realizingthat only the diagonals of the blocks are used to decompose the matrices, a faster

  • FOUR EASY WAYS TO A FASTER FFT 195

    O(n logn) formulation is possible, i.e. the FFT. The radix 2 Cooley-Tukey was thefirst fast formulation using this approach.

    We use A Mathematics of Arrays (MOA) [29, 3537], which relies on the PsiCalculus as a model to provide a mathematical framework for arrays, array index-ing, and array operations, to create our optimized code. The Psi Calculus includesrules, called Psi Reduction Rules, for transforming array expressions while pre-serving semantic equivalence. We used this array index calculus to control programtransformations in the development of our FFT algorithm.

    2. Radix 2 FFT Implementation

    STEP ONE: ALGORITHM DEFINITION

    We began with Van Loans [43] high-level MATLAB program for the radix 2 FFT,shown in Figure 1. This program denotes a single loop program, with high levelarray/vector operations and reshaping.

    In Line 1, Pn is a permutation matrix that performs the bit-reversal permutationon the n elements of vector x. In Line 6, the n element array x is regarded as beingreshaped to be a L r matrix consisting of r columns, each of which is a vector ofL elements. Line 6 can be viewed as treating each column of this matrix as a pair ofvectors, each with L/2 elements, and doing a butterfly computation that combinesthe two vectors in each pair to produce a vector with L elements.

    The reshaping of the data matrix x in Line 6 is column-wise, so that each timeLine 6 is executed, each pair of adjacent columns of the preceding matrix areconcatenated to produce a column of the new matrix. The butterfly computation,corresponding to multiplication of the data matrix x by the weight matrix BL,

    Input: x in Cn and n = 2t , where t 0 is an integer.Output: The FFT of x.

    x Pn x (1)for q = 1 to t (2)

    begin (3)L 2q (4)r n/L (5)xLr BL xLr (6)

    end (7)

    Here, Pn is a n n permutation matrix, BL =[IL LIL L

    ],

    L = L/2, and L is a diagonal matrix with values 1, L, . . . , L1L alongthe diagonal, where L is the Lth root of unity.

    Figure 1. High-level program for the radix 2 FFT.

    MATLAB is commonly used in the scientific community as a high-level prototyping language.

  • 196 LENORE R. MULLIN AND SHARON G. SMALL

    combines each pair of L/2 element column vectors from the old matrix into anew L element vector of values for each new column.

    By direct inspection of Figure 1, the following three basic modules for the radix2 FFT can be identified.

    (1) the computation of the bit-reversal permutation (Line 1),(2) the computation of the complex weights occurring in the matrices L . (Note:

    Since L = 2q , the array BL is different for each iteration of the loop in Figure 1.Accordingly, the specification of IL and L is parameterized byL (and henceimplicitly by q).)

    (3) the butterfly computation that, using the weights, combines two vectors fromx, producing a vector with twice as many elements (Line 6).

    A fourth basic module can be envisioned also:

    (4) the generalization of the radix from two to n.Generalizations of these three modules are appropriate components of arbitrary

    size FFTs. More generally, analogues of these modules are appropriate for thewider class of FFT-like transforms [12].

    STEP TWO: TUNING THE MATRIX AND DATA VECTOR

    The weight constant matrix BL is a large two-dimensional sparse array, implyingthat most of the matrix multiply is not needed. Alternatively, this computation, canbe naturally specified by using constructors from the array calculus with a fewdense one-dimensional vectors as the basis. This yields a succinct description ofhow BL can be constructed via a composition of array operations. However, wenever actually materialize BL as a dense L L matrix. Rather, the succinct repre-sentation is used to guide the generation of code for the FFT, and the generated codeonly has operations where multiplication by a nonzero element of BL is involved.

    Note also, on the top-level, BL is constructed by the appropriate concatenationof four submatrices. Each of these four submatrices is a diagonal matrix. For eachof these diagonal matrices, the values along the diagonal are successive powers ofa given scalar. However, only the construction of two of these matrices, namelyIL and L , need be independently specified. Matrix IL occurs twice in the de-composition of BL. Matrix L can be obtained from matrix L by applyingelement-wise unary minus.

    There is an MOA constructor that produces a vector, which we denote as weight,whose elements are consecutive multiples or powers of a given scalar value.

    A direct and easily automated scalarization of the matrix multiplication in Line6 of Figure 1 produced code similar to that given in Figure 2. Here weight isa vector of powers of L. Because the scalarized code does an assignment to one

    The displayed code uses simplified indexing, i.e. the data array was not reshaped for each valueof q.

  • FOUR EASY WAYS TO A FASTER FFT 197do col = 0,r-1

    do row = 0,L-1if (row < L/2 ) then

    xx(row,col) = x(row,col) + weight(row)*x(row+L/2,col)else

    xx(row,col) = x(row-L/2,col) - weight(row-L/2)*x(row,col)end if

    end doend do

    Figure 2. Direct scalarization of the matrix multiplication.

    do col = 0,r-1do row = 0,L-1

    if (row < L/2 ) thenxx(L*col+row) = x(L*col+row) + weight(row)*x(L*col+row+L/2)

    elsexx(L*col+row) = x(L*col+row-L/2) - weight(row-L/2)*x(L*col+row)

    end ifend do

    end do

    Figure 3. Scalarized code with data stored in a vector.

    element at a time, the generated code computes the result of the array multiplicationusing a temporary array xx, as shown below, and then copies xx into x.

    The sparseness of matrix BL is reflected in the assignments to xx(row,col), asfollows. The right-hand side of these assignment statements is the sum of only thetwo terms that correspond to the two nonzero elements of the row of BL involvedin the computation of the left-hand side. The multiplication by the constant 1 fromthe matrix IL is done implicitly in the generated code.

    The data array x in Figure 1 is a two-dimensional array, that is reshaped duringeach iteration. Computationally, however, it is more efficient to store the data valuesin x as a one dimensional vector, and to avoid reshaping completely. MOAs indexcalculus easily automates such transformations as the mapping of indices of anelement of the two-dimensional array into the index of the corresponding elementof the vector.

    The L r matrix xLr is stored as an n-element vector, which we denote asx. This is done in column-major order, reflecting the column-wise reshaping oc-curring in Figure 1. Thus, element xLr (row, col) of xLr corresponds to elementx(Lcol+row) of x. Consequently, when L changes, and matrix xLr is reshaped,no movement of data elements of vector x actually occurs.

    Generating the same scalarized code as above, but with each access to an ele-ment of the two-dimensional matrix xLr replaced by an access to the correspond-ing element of the vector x, produces the code shown in Figure 3.

    However, we scalarized the code by having the outer loop variable iterate overthe starting index of each column in vector x, using the appropriate stride to incre-

  • 198 LENORE R. MULLIN AND SHARON G. SMALL

    do col = 0,n-1,Ldo row = 0,L-1

    if (row < L/2 ) thenxx(col+row) = x(col+row) + weight(row)*x(col+row+L/2)

    elsexx(col+row) = x(col+row-L/2) - weight(row-L/2)*x(col+row)

    end ifend do

    end do

    Figure 4. Striding through the data vector.

    do col = 0,r-1do row = 0,L/2-1

    do group = 0,1if ( group == 0 ) thenxx(row,group,col) =

    x(row,group,col) + weight(row)*x(row,group+1,col)elsexx(row,group,col) =

    x(row,group-1,col) - weight(row)*x(row,group,col)end if

    end doend do

    end do

    Figure 5. Retiled loop.

    ment this loop variable. Thus, instead of using a loop variable col, which rangesfrom 0 to r 1 with a stride of 1, we use a variable, say col, such that col =L col, and which consequently has a stride of L. By doing this, we eliminate themultiplication L col that occurs each time an element of the arrays xx or x isaccessed.

    This form of scalarization produced the code shown in Figure 4.

    STEP THREE: TUNING THE CODE

    The use of the conditional statement that tests row < L/2 in the above code,can be eliminated automatically, as follows. We first re-tile the loop structure sothat the innermost loop iterates over each pair of data elements that participate ineach butterfly combination. To accomplish this re-tiling, we envision reshaping thetwo-dimensional array xLr into a three-dimensional array xL/22r . Under thisreshaping, element xLr (row, col) corresponds to element xL/22r (row modL/2,row/(L/2), col). The middle dimension of xL/22r splits each column of xLrinto the upper and lower parts of the column. If we were to scalarize Line 6 ofFigure 1 based on the three-dimensional array xL/22r , and index over the thirddimension in the outer loop, over the first dimension in the middle loop, and over

  • FOUR EASY WAYS TO A FASTER FFT 199do col = 0,r-1

    do row = 0,L/2-1xx(row,0,col) =

    x(row,0,col) + weight(row)*x(row,1,col)xx(row,1,col) =

    x(row,0,col) - weight(row)*x(row,1,col)end do

    end do

    Figure 6....