SY Lecture012 Advanced

A Systolic FFT Architecture for Real Time FPGA Systems

Radar Processing Application8K FFT bottleneckReal-timeComplex0.6 GSPS input (16-bits)1.2 GSPS output (12-bits)ADC1.2 GSPSxy32KCorrelationADC1.2 GSPSI/QFFTFIFOConjugateI/QFFTFIFO++FIFO

Evaluation ScorecardThe design changes will be scored based on the following metrics:Length of FFTButterfliesAdder/subtractorsIO pinsMultipliersShift registers

Size168192Pins???Fly???Mult???Add???Shift???

OutlineIntroduction

Parallel architectureData flow graphEffects of serial input

Systolic architecture

Performance summary

Conclusions

Baseline Parallel Architecture12345678910111213141516123456789101112131415161234567891011121314151612345678910111213141516Parallel FFT Butterfly structureRemoves redundant calculation

Size168192Pins448229KFly3253KMultAddShift00

Complex ButterflyButterfly contains1 complex addition1 complex subtraction 1 complex, constant multiply

uvxy+-

Size168192Pins448229KFly3253KMultAddShift00

Complex AdditionComplex addition adds the real and imaginary parts separately:

2 addsacbd++realimag

Size168192Pins448229KFly3253KMultAdd128213KShift00

Complex MultiplyThe FOIL method of multiplying complex numbers:4 multiplies and 2 addsac-+bdrealimag

Size168192Pins448229KFly3253KMult128213KAdd192320KShift00

Efficient Complex MultiplyAnother approach requires fewer multiplies:3 multiplies and 5 addsab-+cd-+-realimag

Size168192Pins448229KFly3253KMult96159K75%Add288480K150%Shift00

Parallel-Pipelined ArchitectureA pipelined version IO Bound100% Efficient

12345678910111213141516123456789101112131415161234567891011121314151612345678910111213141516

Size168192Pins448229KFly3253KMult96159KAdd288480KShift00

Serial Input12345678910111213141516123456789101112131415161234567891011121314151612345678910111213141516A serial version IO-rate matches A/D6.25% Efficient

Size168192Pins2828.01%Fly3253KMult96159KAdd288480KShift00

OutlineIntroduction

Parallel architecture

Systolic architectureSerial implementationApplication specific optimizations

Performance summary

Conclusions

Serial ArchitectureThe parallel architecture can be collapsedOne butterfly per stageConsumes 1 sample per cycleSame latency and throughputMore efficient design50% EfficiencyStage 1Stage 2Stage 3Stage 4

Size168192Pins2828Fly413.03%Mult1239.03%Add36117.03%Shift2212K

High Level ViewReplace complex structure with an abstract cell which contains:FIFOsButterflySwitch networkStage 1Stage 2Stage 3Stage 4

Size168192Pins2828Fly413Mult1239Add36117Shift2212K

8192-Point ArchitectureRequires 13 stagesFixed point arithmeticVaries the dynamic range to increase accuracyOverflow replaced with saturated value4 int14 frac5 int13 frac6 int12 frac7 int11 frac8 int10 frac9 int9 frac10 int8 fracMultipliers limit design to 18-bits and 150 MHzAchieves 70 dB of accuracy 4 int4 frac0110.0101

Size168192Pins2828Fly413Mult1239Add36117Shift2212K

Increase ParallelismAdd more pipelinesDesign limited to 150 MHz by multipliersI/Q module generate 600 MSPSMeets real-time requirement through parallelism

Size168192Pins112112400%Fly1652400%Mult48156400%Add144468400%Shift1612K100%

SimplificationTarget application allows a specific simplificationPads a 4096-point sequence with 4096 zerosRemoves 1st stage multipliers and addersAchieves 100% efficiency in steady state

Size168192Pins160160143%Fly1652Mult3614492%Add10843292%Shift48K67%

OutlineIntroduction



Performance summaryPower, operations per secondFPGA resources, frequencyLatency, throughput

Conclusions

ResultsThe current implementation has been placed on a Virtex II 8000 and verified at 150 MHz

Power: 22 Watts @ 65 CGOPS: 86 total @ 3.9 GOPS/Watt

FPGA resources (XC2V8000)Multipliers: 144 (85%)LUTs and SRLs: 39,453 (42%)BlockRAM: 56 (33%)Filp flops: 35,861 (38%)

Frequency: 150 MHzLatency: 1127 cyclesThroughput: 1.2 GSPS

OutlineIntroduction



Performance summary

ConclusionsApplicability to other platformsFuture work

ConclusionsCreated a high performance, real-time FFT coreLow power (3.9 GOPS/Watt)High throughput (1.2 GSPS), low latency (7.6 sec/sample)Fixed-point (18-bits), high accuracy (70 dB)

General architectureExtendable to a generic FPGA coreRetargetable to ASIC technology

Future workDevelop a parameterizable IP core generator

SourcesFredrik EdmanPreston Jackson, Cy Chan, Charles Rader, Jonathan Scalera, and Michael VaiHPEC 200429 September 2004

11Here is an example radar application. The radar receives two signals, x and y. The objective is to compute the cross correlation of these signals and then determine the time delay between them. Signals of length 32 K are used to compute the cross correlation. The high sampling rate of 1.2 Giga samples per second presents the main challenge. The box near the bottom shows a blow-up of the correlation function. The main bottleneck is the FFT which must keep up with the 600 MSPS of complex data coming from the I/Q unit.This presentation will begin by presenting the standard, parallel FFT architecture. As modifications are made to meet our application needs, the effects will be noted in the scorecard. We will be able to compare the architectures based on these metrics: number of IO pints, butterfly stages, multipliers, adder/subtractors, and shift registers.This is the parallel FFT architecture. When compared with a direct implementation of the DFT, it removes redundant calculations. The complex butterfly contains 1 multiplication (from the twiddle factor) and an addition and subtraction. These operations are complex, and thus in terms of real operations they are more complicated.A complex addition sums the real part of the two addends to produce the real part of the complex result. Likewise, the two imaginary parts of the addend are summed to produce the imaginary result.The complex multiply requires 4 real multiplies an addition and a subtraction. This is where the majority of the computational load of the system takes place. On a Xilinx FPGA the multipliers are ASIC cores. If we could reduce the number of multiplications, we could potentially fit a larger FFT on the same FPGA. Multipliers and routing limit the size of are the limiting factor.This is a more efficient (in terms of multipliers) implementation of a complex multiplication. The number of real multiplies is reduced to 3 while increasing the number of additions to 5. Since multipliers are the limiting factor, this is a good trade.The parallel implementation is easily pipelined. This architecture would be 100% efficient if it could be supplied with enough data. However, the number of input pins needed to keep an 8192-point FFT would be 65606, with only 8-bit inputs. Also, A/D converters cannot keep up with this rate of data generation. Thus, a serial design is often used to match with the serial input.If we use the same parallel architecture with serial data, we achieve only 6.25% efficiency. This design can be collapsed to improve efficiency.This is the collapsed design for the FFT. Only one butterfly per stage is needed. The routing networks and shift registers ensure that the data are presented to the right butterfly at the right time. It can achieve 50% efficiency in steady state. The only delay comes in waiting for the data at the beginning. This abstraction simplifies the diagram from the previous page. Each stage block contains 1 butterfly, a routing network, and 2 FIFOs. Using this more simple representation we can see the implementation of the 8192-point FFT.This is the high-level view of the 8192-point FFT in serial format. It requires 13 stages to complete the butterfly. Our design uses fixed-point arithmetic while varying the numbers of bits dedicated to the integer and fractional portions (also called Block Floating Point Arithmetic). This method allows us to use simple fixed-point operations and yet in tests of representative data, we are able to achieve 70 dB of accuracy.Our I/Q module generates 600 million complex samples per second. However, the design only operates at 150 MHz. Thus, we can increase the number of simultaneous calculations that occur by increasing the number of pipelines and effectively the amount of parallelism. No dependent data are found in the first 11 stages; after this point, data from separate pipelines must be shared. This architecture allows us to logic to different FPGAs with ease, as long as the last two stages stay on the same FPGA.Another optimization we are able to make is the reduction of the data rate. Our specific application allows for 0s to be placed on the inputs of the FFT. This allows us to achieve 100% efficiency. We could also achieve 100% efficiency by ensuring that the data rate was twice that of the clock rate of the FFT.The performance of this FFT is very good. Of particular note is the 78 GOPS on a single Virtex II FPGA. The latency is constant for any sample set of data and the throughput is 1.200 GSPS.This presentation has shown a high performance FFT which achieves 3.5 GOPS/Watt. It operates in real time and with high accuracy. It is a general architecture which can be retargeted to ASICs or a general purpose FPGA core. In the future we may continue to reduce the logic and routing requirements and improve the operating frequency.

Documents

SY Lecture012 Advanced