Parallel Systolic FFT Architectures

7/30/2019 Parallel Systolic FFT Architectures

1/13

ClariPhy Confidential

Parallel Systolic FFT Architectures for High-Speed,

High Throughput Frequency-Domain Filtering

October 12, 2012

Oscar E. Agazzi


2/13

ClariPhy Confidential 2

Overview

Introduction

Systolic FFT architecture (radix 2)

Parallel systolic architectures

Storage requirements

Other considerations

Conclusions


3/13


Introduction (1)

In this presentation we investigate high-speed, high throughputarchitectures for FFTs

The main problem that it is desired to address is how to simplify the

complex interconnection pattern resulting from butterflies in FFT

implementations derived (directly or indirectly) from FFT flow diagrams

Systolic architectures greatly simplify the interconnections, at the

expense of increasing the storage requirements

Systolic architecturesper se may not be sufficient to achieve the

throughput and speed required by the BCD filter in the CL10010

Systolic architectures may need to be combined with parallel processing

and some degree of traditional, butterfly-based architectures


4/13


Introduction (2)

The work presented here is largely based on the systolic FFTarchitecture described in reference [1], however no good references

have been found on how to combine systolic implementations with

parallel processing

The approach presented here may be similar to the one described in [2],

but that reference is not explicit enough to replicate its work For simplicity, in this presentation we consider only radix 2 FFTs,

however additional savings may be achieved by using higher radix FFTs

OLeary [1] reports that savings may be achieved by using radix 4

transforms


5/13


Systolic FFT Architecture (radix 2)

Delay 4 +

- X Delay 2

Delay 2 +

- X Delay 1

Delay 1 +

-

Top Output

Bottom Output

Input 1

Input 2

W0, W1, W2, W3 W0, W2

Example for N=8

N/2 3N/2N 2N 5N/2

NEG A NEG B NEG C

POS A POS B POS C

BLOCK B BLOCK D

BLOCK A BLOCK C

I/O Timing

FFT

Size

Memory

(Complex

Words)

Complex

Multipliers

Complex

Adders

N ~3N/2 log2(N)-1 2log2(N)

8 12 2 616 24 3 8

32 48 4 10

64 96 5 12

128 192 6 14

Complexity vs. FFT Size N


6/13


Discussion

The systolic processor has an extremely simple interconnection pattern

Although memory size grows linearly with N, it is quite manageable for

N=64 or even N=128, which are the likely sizes for a parallel/systolic

FFT processor for the CL10010 BCD filter

Notice that the processor shown in the previous slide can process two

independent FFTs at the same time

The inputs must be skewed in time by N/2 (this requires additional

buffering)

The outputs come sequentially (aligning the outputs also requires

additional buffering)

The outputs come in bit reverse order


7/13


FFT Parallelization

In the following discussion we use a numerical example to make thediscussion more concrete

We assume that the FFT size is N=8192 and the desired throughput is

64Gs/s

We also assume that the input comes in blocks of consecutive samples

of size D=128

Therefore a complete FFT block of 8192 samples can be thought as a

matrix of samples of 64 rows and 128 columns

The FFT processor must accept blocks of 128 samples (where each block is

a row of the matrix) at a rate of 500MHz

The discussion can be easily generalized to other FFT sizes N and

decimation factors D


8/13


FFT Parallelization (cont.)

The parallelization of the FFT is based on the following factorization:

This can be expressed as:

Writing withp=0,,128 and q=0,,63, and observing

that Xr(k) is periodic in k with period 64, we can write:

Finally:

Where the FFT is taken with respect to index r

The implementation of this factorization is shown in the following slide

=

=

8191

0

8192)()(n

nkWnxkX

=

N

jWN

2exp

)()128()(

127

0

63

0

127

0

8192648192 kXWWrmxWkX rr m r

rkmkrk

= = ==+=

qpk += 64

)()()64( 8192

127

0

128

127

0

)64(

8192 qXWWqXWqpX rrq

r

rp

r

r

qpr ==

+

==+

{ })()64( 8192128 qXWFFTqpX rrq

=+


9/13


Parallel/Systolic Processor

Serial

toParallelConverter

Input

fs=64GHz

FFT Leaf 0

FFT Leaf 1

FFT Leaf 63

Scalers

128PointFFT

fD=500MHzfs=64GHz

FFTOutput:6

4blocksof12

8sampleseach


10/13


Discussion

The only complex interconnections in this processor occur in the 128-pointoutput FFT

However, this FFT is relatively small so that its interconnections should not be a

problem

By comparison, consider that the BCD filter in the CL4010 uses an FFT size

of 512

The FFT required by the processor proposed here is 4 times smaller, and

the technology is more advanced than in the CL4010

The processor described here lends itself to an extremely regular and simple

layout

The output comes in the form of a matrix of complex numbers with 64 rows

and 128 columns with both columns and rows in bit reverse order

It is not necessary to reorder them because the IFFT can automatically reverse

the order of both rows and columns

Frequency domain filtering can be implemented in bit reverse order


11/13


Hardware Requirements

Hardware Component Number of Units

Memory (Complex Words) 10240

Memory (Bits)

(assumes average word length is 24 bits)491520

Complex Multipliers 896

Complex Adders 1216

AssumptionsNumbers are per polarization and per FFT block

Assuming 2 polarizations and IFFT similar to FFT, numbers in table should be

quadrupled

Pipeline registers not includedOutput FFT requires (N/2)log2(N) complex multipliers and equal number of

complex adders

Scaler requires 128 complex multipliers


12/13


Conclusions

A systolic architecture can considerably simplify the routing of large block size,high throughput, high speed FFTs

In deep submicron CMOS technologies, interconnections have a large impact

on the power dissipation, therefore it is important to use regular architectures

that lead to an efficient layout and to minimize interconnections

In this presentation we have proposed an architecture that has the potential to

meet the requirements of the CL10010

However, significant work still needs to be done to explore alternative values of

parameters, such as DSP clock speed, parallelization factor, size of the front-

end FFTs (FFT Leaves) versus size of the back-end FFT, radices different from 2,

etc.

It is believed that this work can lead to a very efficient implementation of the

BCD filter in the CL10010


13/13


References

[1] G.C.OLeary, Nonrecursive Digital Filtering Using Cascad Fast Fourier Transformers, IEEETransactions on Audio and Electroacoustics, Vol. AU-18, No.2, June 1970, pp.177-183

[2] P.Jackson et al, A Systolic FFT Architecture for Real Time FPGA Systems, MIT Lincoln

Laboratory publication, September 29, 2004

[3] T.Woodward, private communication

[4] A.V.Oppenheim, Applications of Digital Signal Processing, Prentice Hall, 1978, Chapter 5

(Applications of Digital Signal Processing to Radar)

Documents

Parallel Systolic FFT Architectures