Upload
pablo-quiroga
View
236
Download
0
Embed Size (px)
Citation preview
7/30/2019 Parallel Systolic FFT Architectures
1/13
ClariPhy Confidential
Parallel Systolic FFT Architectures for High-Speed,
High Throughput Frequency-Domain Filtering
October 12, 2012
Oscar E. Agazzi
7/30/2019 Parallel Systolic FFT Architectures
2/13
ClariPhy Confidential 2
Overview
Introduction
Systolic FFT architecture (radix 2)
Parallel systolic architectures
Storage requirements
Other considerations
Conclusions
7/30/2019 Parallel Systolic FFT Architectures
3/13
ClariPhy Confidential 3
Introduction (1)
In this presentation we investigate high-speed, high throughputarchitectures for FFTs
The main problem that it is desired to address is how to simplify the
complex interconnection pattern resulting from butterflies in FFT
implementations derived (directly or indirectly) from FFT flow diagrams
Systolic architectures greatly simplify the interconnections, at the
expense of increasing the storage requirements
Systolic architecturesper se may not be sufficient to achieve the
throughput and speed required by the BCD filter in the CL10010
Systolic architectures may need to be combined with parallel processing
and some degree of traditional, butterfly-based architectures
7/30/2019 Parallel Systolic FFT Architectures
4/13
ClariPhy Confidential 4
Introduction (2)
The work presented here is largely based on the systolic FFTarchitecture described in reference [1], however no good references
have been found on how to combine systolic implementations with
parallel processing
The approach presented here may be similar to the one described in [2],
but that reference is not explicit enough to replicate its work For simplicity, in this presentation we consider only radix 2 FFTs,
however additional savings may be achieved by using higher radix FFTs
OLeary [1] reports that savings may be achieved by using radix 4
transforms
7/30/2019 Parallel Systolic FFT Architectures
5/13
ClariPhy Confidential 5
Systolic FFT Architecture (radix 2)
Delay 4 +
- X Delay 2
Delay 2 +
- X Delay 1
Delay 1 +
-
Top Output
Bottom Output
Input 1
Input 2
W0, W1, W2, W3 W0, W2
Example for N=8
N/2 3N/2N 2N 5N/2
NEG A NEG B NEG C
POS A POS B POS C
BLOCK B BLOCK D
BLOCK A BLOCK C
I/O Timing
FFT
Size
Memory
(Complex
Words)
Complex
Multipliers
Complex
Adders
N ~3N/2 log2(N)-1 2log2(N)
8 12 2 616 24 3 8
32 48 4 10
64 96 5 12
128 192 6 14
Complexity vs. FFT Size N
7/30/2019 Parallel Systolic FFT Architectures
6/13
ClariPhy Confidential 6
Discussion
The systolic processor has an extremely simple interconnection pattern
Although memory size grows linearly with N, it is quite manageable for
N=64 or even N=128, which are the likely sizes for a parallel/systolic
FFT processor for the CL10010 BCD filter
Notice that the processor shown in the previous slide can process two
independent FFTs at the same time
The inputs must be skewed in time by N/2 (this requires additional
buffering)
The outputs come sequentially (aligning the outputs also requires
additional buffering)
The outputs come in bit reverse order
7/30/2019 Parallel Systolic FFT Architectures
7/13
ClariPhy Confidential 7
FFT Parallelization
In the following discussion we use a numerical example to make thediscussion more concrete
We assume that the FFT size is N=8192 and the desired throughput is
64Gs/s
We also assume that the input comes in blocks of consecutive samples
of size D=128
Therefore a complete FFT block of 8192 samples can be thought as a
matrix of samples of 64 rows and 128 columns
The FFT processor must accept blocks of 128 samples (where each block is
a row of the matrix) at a rate of 500MHz
The discussion can be easily generalized to other FFT sizes N and
decimation factors D
7/30/2019 Parallel Systolic FFT Architectures
8/13
ClariPhy Confidential 8
FFT Parallelization (cont.)
The parallelization of the FFT is based on the following factorization:
This can be expressed as:
Writing withp=0,,128 and q=0,,63, and observing
that Xr(k) is periodic in k with period 64, we can write:
Finally:
Where the FFT is taken with respect to index r
The implementation of this factorization is shown in the following slide
=
=
8191
0
8192)()(n
nkWnxkX
=
N
jWN
2exp
)()128()(
127
0
63
0
127
0
8192648192 kXWWrmxWkX rr m r
rkmkrk
= = ==+=
qpk += 64
)()()64( 8192
127
0
128
127
0
)64(
8192 qXWWqXWqpX rrq
r
rp
r
r
qpr ==
+
==+
{ })()64( 8192128 qXWFFTqpX rrq
=+
7/30/2019 Parallel Systolic FFT Architectures
9/13
ClariPhy Confidential 9
Parallel/Systolic Processor
Serial
toParallelConverter
Input
fs=64GHz
FFT Leaf 0
FFT Leaf 1
FFT Leaf 63
Scalers
128PointFFT
fD=500MHzfs=64GHz
FFTOutput:6
4blocksof12
8sampleseach
7/30/2019 Parallel Systolic FFT Architectures
10/13
ClariPhy Confidential 10
Discussion
The only complex interconnections in this processor occur in the 128-pointoutput FFT
However, this FFT is relatively small so that its interconnections should not be a
problem
By comparison, consider that the BCD filter in the CL4010 uses an FFT size
of 512
The FFT required by the processor proposed here is 4 times smaller, and
the technology is more advanced than in the CL4010
The processor described here lends itself to an extremely regular and simple
layout
The output comes in the form of a matrix of complex numbers with 64 rows
and 128 columns with both columns and rows in bit reverse order
It is not necessary to reorder them because the IFFT can automatically reverse
the order of both rows and columns
Frequency domain filtering can be implemented in bit reverse order
7/30/2019 Parallel Systolic FFT Architectures
11/13
ClariPhy Confidential 11
Hardware Requirements
Hardware Component Number of Units
Memory (Complex Words) 10240
Memory (Bits)
(assumes average word length is 24 bits)491520
Complex Multipliers 896
Complex Adders 1216
AssumptionsNumbers are per polarization and per FFT block
Assuming 2 polarizations and IFFT similar to FFT, numbers in table should be
quadrupled
Pipeline registers not includedOutput FFT requires (N/2)log2(N) complex multipliers and equal number of
complex adders
Scaler requires 128 complex multipliers
7/30/2019 Parallel Systolic FFT Architectures
12/13
ClariPhy Confidential 12
Conclusions
A systolic architecture can considerably simplify the routing of large block size,high throughput, high speed FFTs
In deep submicron CMOS technologies, interconnections have a large impact
on the power dissipation, therefore it is important to use regular architectures
that lead to an efficient layout and to minimize interconnections
In this presentation we have proposed an architecture that has the potential to
meet the requirements of the CL10010
However, significant work still needs to be done to explore alternative values of
parameters, such as DSP clock speed, parallelization factor, size of the front-
end FFTs (FFT Leaves) versus size of the back-end FFT, radices different from 2,
etc.
It is believed that this work can lead to a very efficient implementation of the
BCD filter in the CL10010
7/30/2019 Parallel Systolic FFT Architectures
13/13
ClariPhy Confidential 13
References
[1] G.C.OLeary, Nonrecursive Digital Filtering Using Cascad Fast Fourier Transformers, IEEETransactions on Audio and Electroacoustics, Vol. AU-18, No.2, June 1970, pp.177-183
[2] P.Jackson et al, A Systolic FFT Architecture for Real Time FPGA Systems, MIT Lincoln
Laboratory publication, September 29, 2004
[3] T.Woodward, private communication
[4] A.V.Oppenheim, Applications of Digital Signal Processing, Prentice Hall, 1978, Chapter 5
(Applications of Digital Signal Processing to Radar)