Download pdf - CAD Tool Autogeneration of VHDL FFT for FPGA ASIC Implementation.pdf

CAD Tool Autogeneration of VHDL FFT for

FPGA/ASIC Implementation

Todd E. Schmuland and Mohsin M. Jamali

Department of Electrical Engineering & Computer Science

The University of Toledo

Toledo, OH, 43606

[email protected], [email protected]

Matthew B. Longbrake and Peter E. Buxa

AFRL/RYDR

Wright-Patterson AFB

Dayton, OH, 45433

[email protected], [email protected]

Abstract—Hand-coding Fast Fourier Transforms (FFTs) inHardware Description Language (HDL) is time consuming andprone to errors. Proprietary IP cores are available, howeverthey are closed-source and unviewable. The open-source FFTgenerator SPIRAL is available, however it only produces parallelarithmetic solutions and thus limits the maximum FFT size thatwill fit in available Field Programmable Gate Arrays (FPGAs).An autogenerator of VHDL FFTs is described that takes aset of FFT parameters and generates an FFT component withfeedback of occupied slices, maximum frequency, and dynamicrange performance. Both parallel arithmetic and serial-parallelbutterfly architectures can be generated where serial-parallelallows larger sized FFTs to fit inside available FPGA parts.Emphasis is placed on large sized serial-parallel FFTs andportability to Application-Specific Integrated Circuits (ASICs)using Cadence Encounter. Serial-parallel FFT pipeline controland FPGA hardware reduction are also investigated.

Index Terms—FFT; fixed-point; VHDL; FPGA; autogeneration

I. INTRODUCTION

System-on-Chip (SoC) solutions using Field Programmable

Gate Arrays (FPGAs) or Application-Specific Integrated Cir-

cuits (ASICs) are desirable to lower hardware costs and keep

circuit sizes small. The component building blocks used in

an SoC are typically hand-coded in Hardware Description

Language (HDL). This endeavor is both time consuming and

prone to implementation errors. A key component used in

many SoC solutions is the Fast Fourier Transform (FFT) [1].

However, the performance of the FFT component in terms of

occupied slices, maximum frequency, and dynamic range, is

not known until the HDL for the FFT component is coded,

synthesized, and measured. It would be desirable to have a

software tool autogenerate HDL for an FFT component where

an engineer simply provides the targeted characteristics of the

FFT. In addition, the software tool should give feedback to

the engineer on the performance of the autogenerated FFT

component. Using this feedback, the engineer can focus on

the overall SoC performance and make adjustments to the FFT

component as necessary.

This paper describes a software tool, written as a MATLAB

function, that allows an FFT to be specified via its pro-

grammable/selectable parameters such as input word size, FFT

This work was sponsored by the Dayton Area Graduate Studies Institute(DAGSI) with support from the Air Force Research Lab, Sensors Directorate.

size, Decimation-in-Time (DIT) or Decimation-in-Frequency

(DIF), butterfly radix, phase factor (twiddle factor) quantiza-

tion, and stage scaling. The software tool provides a bit-true

MATLAB simulation of the FFT architecture and an option to

autogenerate Very-high-speed integrated circuit HDL (VHDL)

for vendor-independent FPGA or ASIC implementation. The

generated FFT structure is fully parallel where every butterfly

is instantiated, however each butterfly is optimized by lever-

aging constant multipliers.

Unlike commercial FFT IP cores [2]–[4], which are deliv-

ered as a black box FFT implementation that conform to a set

of parameters, this software tool generates open source VHDL

of parameterized FFTs whose contents are fully exposed and

portable to other hardware platforms. The popular SPIRAL

FFT generator [5] only produces HDL with parallel arithmetic

using DSP blocks which limits the FFT size that can fit into

a given FPGA part. Our software tool however, gives the

option of autogenerating a serial-parallel constant multiplier

butterfly architecture, thus allowing larger FFT sizes to fit in

available FPGA parts. The targeted use of the resulting VHDL

entity is for SoC solutions where the bulk of DSP and RAM

blocks [6]–[8] are reserved for components other than the FFT

component.

The paper is organized as follows. In Section II, the FFT

parameters the VHDL autogeneration tool accepts as input

are described. Section III discusses the functions, macros, and

entities developed as a VHDL package for constructing FFTs.

The autogeneration of VHDL for parallel arithmetic and serial-

parallel butterfly architectures is covered in Section IV. Section

V looks at the performance characteristics of the autogenerated

VHDL code and the portabilty of the VHDL code. Finally is

the conclusion.

II. SOFTWARE DEFINED FFT PARAMETERS

The VHDL autogenerator takes many FFT parameters as

input. This offers great flexibility in designing an FFT com-

ponent for an SoC without having to write a single line of

VHDL code. The FFT parameters entered into the VHDL

autogenerator are:

• Number of input samples

• DIT or DIF

• Butterfly radix of 2, 4, or split (auto-determined)

978-1-4673-0859-5/12/$31.00 ©2012 IEEE

237

x[0]

x[1]

x[2]

x[3]

x[4]

x[5]

x[6]

x[7]

x[8]

x[9]

x[10]

x[11]

x[12]

x[13]

x[14]

x[15]

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W0

W4

W4

W4

W4

W0

W0

W0

W0

W0

W2

W4

W6

W0

W1

W2

W3

W0

W3

W6

W9

W0

W0

W0

W4

W0

W0

W0

W4

W0

W0

W0

W4

W0

W0

W0

W4

X[0]

X[8]

X[4]

X[12]

X[2]

X[10]

X[6]

X[14]

X[1]

X[9]

X[5]

X[13]

X[3]

X[11]

X[7]

X[15]

Fig. 1. FFT flowgraph of a 16-point DIT radix-4 phase factor map withradix-4 butterflies constructed using four generalized 2-input butterflies ashighlighted by the box

• Butterfly architecture: serial-parallel or parallel arithmetic

• Input word size

• Scaling between FFT stages (output word size)

• Fixed or variable length phase factors

• Complex multiplier built using 3 or 4 multipliers

The software tool uses one master FFT flowgraph and

determines the correct phase factor map while leveraging

radix-4 butterflies whenever possible. For example, Fig. 1

shows a 16-point DIT with radix-4 butterflies constructed with

four generalized 2-input butterflies. Fully parallel structured

FFTs have constant phase factors per butterfly, therefore the

complex multipliers reduce to a series of adders shifted by

position. Both fixed and variable length phase factors are

offered as choices for autogenerated VHDL.

Butterfly architecture can either be serial-parallel, where

data is operated on one bit at a time, or it can be processed

in parallel as bit vectors. Serial-parallel butterfly architecture

lends itself well for partitioning very large FFTs across several

FPGA parts, since each data path is only one signal/pin.

In addition, the ratio of serial-parallel throughput versus

occupied slices is favorable since every butterfly of the FFT

is instantiated and clocked simultaneously.

As data flows through an FFT, the word size grows by one

bit per stage due to the add/subtract operations’ carry bit inside

each butterfly. Depending on the application, either the output

can be truncated after the FFT, or the data can be truncated

internal to the FFT, also known as scaling, by truncating one

bit between each FFT stage.

III. DEVELOPED VHDL LIBRARY PACKAGE

The VHDL autogenerator uses our developed VHDL li-

brary package consisting of functions, macros, and entities

to construct the VHDL files for a specified FFT as shown

in Fig. 2. Functions were used for the parallel arithmetic

since they allow the recursive function calling necessary to

construct the constant multipliers which consist of adders

shifted by position. Functions also facilitate easier construction

FFT size

2n

User supplied input parameters

- FFT Size, DIT/DIF, scaling

Generate phase factor constants

Generate FFT flowgraph

- Butterfly data paths

- Phase factor locations

Analyze: trivial * / non-trivial *

Determine phase factor case

Butterfly arithmetic

- Parallel/serial operators

Phase factor multipliers

Butterfly equation

- Phase factor case

Developed VHDL library

- functions

- macros

- entities

(ADD)

(SUB)

(MUL)

process

variables

begin

operators

Stage progression of data

Adaptive scaling between stages

label

operators

port maps

signals

Timing shift register taps (serial)

Bit-reversal output signal map

Mux/Demux entity wrapper

parallel serial

Fig. 2. Autogeneration software tool showing FFT parameter input, flowgraphgeneration, trivial/non-trivial phase factor determination for each FFT stage,generation of each FFT butterfly, stage scaling, and bit-reversal output withmux/demux wrapper

of pipelining where each FFT stage is clocked using VHDL

processes containing arithmetic operators for each butterfly in

a behavioral fashion.

Entities were used for the serial-parallel butterfly archi-

tecture since they require a structural model where each

arithmetic operator clocks and processes one bit of data at

a time. The use of entities also allows generics to be used to

provide the constant bit-string (phase factor) to the constant

multiplier entity. A second generic is used to dictate whether

a buffering delay line should be placed on the entity’s output

and how many clocks the output should be delayed by. This

is necessary to keep all the butterflies of a given FFT stage

in sync such that trivial and non-trivial phase factor butterfly

results appear at the next stage with the same latency. A trivial

butterfly is one where the phase factor is either 1+0j or 0−1j,

thus reducing the complex multiply to a simple identity or

complex conjugate with swap operation respectively.

IV. AUTOGENERATION OF FFT BUTTERFLIES

After the VHDL autogeneration tool has determined the

FFT flowgraph and phase factor locations, the tool creates

a set of VHDL files that will synthesize into a usable FFT

238

1 2 3 4

Pass 1Compute needed variables and bit vector sizes

Begin Butterfly

Generation

Case

Pass 2

U1: Pass-thru

L1: Pass-thru

U1: Pass-thru

L1: Multiply

U1: Multiply

L1: Multiply

U1: Pass-thru

L1: Pass-thru

case 1) U2: add/add, L2: subtract/subtract

case 2) round(L1), U2: add/add, L2: subtract/subtract

case 3) U2: round(add/add), L2: round(subtract/subtract)

case 4) U2: add/subtract, L2: subtract/add

Butterfly Output

U1

L1

U2

L2

WU

WL

1 + 0j

1 + 0j

1 + 0j

Complex *

Complex *

Complex *

1 + 0j

0 - 1j

WU

WL

1 2 3 4

Fig. 3. Autogeneration of FFT butterflies using a two pass approach wherepass one determines the variables required for the butterfly and pass twogenerates the arithmetic operator statements based on trivial/non-trivial phasefactor cases

component. The files include the top level entity for the FFT

component and individual files for each FFT stage. The top

level entity consists of concurrent statements for each FFT

stage, plus a bit-reversed signal assignment from the last FFT

stage to the FFT output, thus creating a component with in-

order input and in-order output signals.

The tool autogenerates each generalized 2-input butterfly

using a two pass approach as shown in Fig. 3. Pass one

computes the needed variables and bit vector sizes while pass

two generates the function or entity arithmetic operators for the

butterfly. Every butterfly in an FFT falls into one of four cases

depending on the upper and lower phase factors and whether

they are a trivial or non-trivial complex multiplication. The

resulting case determines if the input data is passed through

as is or requires a complex multiplication, and determines

how the phase factor multiplications are combined into the

butterfly’s output.

An FFT component using serial-parallel butterfly architec-

ture requires pipeline timing circuitry to control:

1) Timing the data as it passes through the FFT

2) Indicating when each FFT result is ready to be latched

3) Resetting each FFT stage at the correct time

The following equations are used by the software tool

to determine the timing shift register length, tap points for

FFT stage resets, and minimum number of pipeline clocks

required for continuous operation of the FFT component. Each

FFT stage requires a specific number of clocks for its fixed-

point result to appear on its output, with the same fractional

precision, such that

L(n) =

{

1 stage n is all trivial

Tf + 2 stage n has some non-trivial(1)

where L(n) is the number of clocks required for FFT stage n

to have its result appear on its output and Tf is the number

of fraction bits that represent the phase factors of the FFT.

Therefore, the overall timing shift register length is

Ls = 1 +Ni +Nf +Ns +

Ns∑

n=1

L(n) (2)

where Ni and Nf are the number of integer and fraction bits

respectively that represent the input data words to the FFT,

Ns is the number of FFT stages, and L(n) is taken from (1).

The time it takes for the first load signal to propagate

through the timing shift register and appear as the ready signal

is Ls, however subsequent data sets only require Ls − Ns

clocks between load signals.

The number of clocks each FFT stage requires for its input

to start appearing on its output is

K(n) =

{

2 stage n is all trivial

4 stage n has some non-trivial(3)

therefore each FFT stage n reset tap point is given by

T (n) =

{

−1 for n = 1T (n− 1) +K(n− 1) for n = 2 . . .Ns

(4)

where the timing shift register is indexed as 0 . . . Ls − 1. It

should be noted that (3) includes one extra clock required for

each FFT stage to grow the word size of the data value by

one bit. Also, T (1) should be connected directly to the load

signal from the circuit outside the FFT component.

V. PERFORMANCE AND PORTABILITY

To evaluate the performance of the autogenerated VHDL

FFT component, various FFTs were generated and synthesized

using Xilinx ISE 13.2, as shown in Table I, with our largest

available Virtex-6 LXT -3 speed grade FPGA. One can clearly

see that the parallel arithmetic FFT F1 has a much higher

throughput of 23.04 Gs/s versus 2.369 Gs/s for the serial-

parallel FFT F2, however the hardware cost is ≈2.66 times

greater than FFT F2. Moreover, the Virtex-6 LXT using

SelectIO [9] can only input data at 1.4 Gs/s, therefore FFT

F1 is highly data starved. The throughput of serial-parallel

FFT F2 is closer to the maximum I/O data rate of the FPGA

part and is therefore a more appropriate solution.

To analyze serial-parallel butterfly architecture throughput

versus FFT size, FFT F3 was compared to FFT F2 in Table I.

One can see that the latency for the data to flow through the

pipeline increases from 81 to 125 clocks, however the through-

put remains approximately the same at 2.426 Gs/s. This occurs

because the larger front-end of FFT F3 compensates for the

longer pipeline. Also, the hardware cost of serial-parallel FFT

F3 is approximately the same as parallel arithmetic FFT F1,

even though 512-point FFT F3 has four times the number of

data paths and butterflies.

239

TABLE IVARIOUS FFTS WITH 11-BIT INPUT WORDS AND 14-BIT PHASE FACTORS

Point Butterfly Math Slices Pipeline SampleSize Radix Type Used Clocks Rate

F1 256 4 parallel 23862 1 23.04 Gs/sF2 256 4 ser-par 8963 81 2.369 Gs/sF3 512 2 ser-par 24677 125 2.426 Gs/s

Maximum Throughput

SLICEs Used

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Parallel Logic (unscaled)

Parallel Logic (scaled)

Parallel DSP (scaled)

SPIRAL DSP (scaled)

Serial Logic

CoreGen

Fig. 4. Normalized comparison of various Q1.8 64-point radix-2 FFTimplementations using 8-bit fixed length phase factors (288 DSP slices wereused for both Parallel DSP and SPIRAL DSP)

Various 64-point radix-2 FFTs were also generated

and compared to SPIRAL and Xilinx CoreGen in terms

of maximum throughput and slices used. Variations of

scaled/unscaled, and Logic/DSP multipliers were generated.

As Fig. 4 shows, the slices used for unscaled Logic multipliers

is larger than scaled, however the throughput is the same,

indicating the number of adders required to implement the

FFT has a greater impact on hardware cost than the size of the

adders. By simply changing one line in the developed VHDL

library, DSP multipliers were generated instead of Logic

multipliers, resulting in a DSP based FFT that outperformed

SPIRAL DSP both in terms of higher throughput and lower

hardware cost.

To test the portability of the autogenerated VHDL, FFT

F1 from Table I was imported without modification into

Cadence Encounter 6.1.5 using the IBM 65 nm 6-metal digital

ASIC process. The ASIC cell layout completed successfully

as shown in Fig. 5. The FFT cell measures 1.8 mm per side,

contains 482,875 standard cells, has a clock fanout of 63,488

gates, and took approximately 5 hours to complete. The key

for successful importation into Cadence was generated VHDL

code consistency and modularity of the developed VHDL

library package.

VI. CONCLUSION

This paper has described an FFT autogeneration tool that

accepts a set of FFT parameters and generates a VHDL

component to be synthesized for use in FPGAs. Unlike propri-

Fig. 5. 256-point radix-4 FFT imported without modification to CadenceEncounter using IBM 65 nm 6-metal process (the cell size is 1.8 mm perside)

etary FFT IP cores, the VHDL code generated is completely

viewable and modifiable with no reliance on any particular

FPGA vendor. Both parallel arithmetic and serial-parallel

butterfly architectures can be generated where serial-parallel

allows larger sized FFTs to fit inside available FPGA parts.

A 512-point serial-parallel FFT was successfully generated

and easily fits available FPGA parts with regards to I/O

throughput and slices used. A comparison of 64-point radix-2

FFTs was performed and has shown that the autogenerated

parallel arithmetic DSP based FFT had higher throughput and

lower hardware cost than the equivalent SPIRAL generated

FFT. In addition, the generated VHDL code can be imported

without modification into Cadence Encounter for rapid FFT

cell creation.

REFERENCES

[1] J.W. Cooley and J.W. Tukey, ”An Algorithm for the Machine Calculationof Complex Fourier Series,” Math. Computation, Vol. 19, 1965, pp. 297–301.

[2] DSP Cores from IP Cores, Inc. (http://www.ipcores.com).[3] FFT IP from Dillon Engineering, Inc. (http://www.dilloneng.com).[4] C. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, ”Multidimensional

DFT IP Generator for FPGA Platforms,” IEEE T. Circuits-I, Vol. 58,No. 4, Apr. 2011, pp. 755–764.

[5] DFT/FFT IP Core Generator from Carnegie Mellon Univer-sity (http://www.spiral.net).

[6] J.G. Proakis and D.G. Manolakis, Digital Signal Processing, 4th ed.,Upper Saddle River: Pierson Prentice Hall, 2007, pp. 449–461.

[7] L. Wenqi, W. Xuan, and S. Xiangran, ”Design of Fixed-Point High-Performance FFT Processor,” ICETC 2010, Vol. 5, 2010, pp. V5-139–V5-143.

[8] K. Maharatna, E. Grass, and U. Jagdhold, ”A 64-Point Fourier TransformChip for High-Speed Wireless LAN Application Using OFDM,” IEEE

J. Solid-St. Circ. , Vol. 39, No. 3, Mar. 2004, pp. 484–493.[9] Virtex-6 LXT FPGAs from Xilinx, Inc. (http://www.xilinx.com).

240