CAD Tool Autogeneration of VHDL FFT for
FPGA/ASIC Implementation
Todd E. Schmuland and Mohsin M. Jamali
Department of Electrical Engineering & Computer Science
The University of Toledo
Toledo, OH, 43606
[email protected], [email protected]
Matthew B. Longbrake and Peter E. Buxa
AFRL/RYDR
Wright-Patterson AFB
Dayton, OH, 45433
[email protected], [email protected]
Abstract—Hand-coding Fast Fourier Transforms (FFTs) inHardware Description Language (HDL) is time consuming andprone to errors. Proprietary IP cores are available, howeverthey are closed-source and unviewable. The open-source FFTgenerator SPIRAL is available, however it only produces parallelarithmetic solutions and thus limits the maximum FFT size thatwill fit in available Field Programmable Gate Arrays (FPGAs).An autogenerator of VHDL FFTs is described that takes aset of FFT parameters and generates an FFT component withfeedback of occupied slices, maximum frequency, and dynamicrange performance. Both parallel arithmetic and serial-parallelbutterfly architectures can be generated where serial-parallelallows larger sized FFTs to fit inside available FPGA parts.Emphasis is placed on large sized serial-parallel FFTs andportability to Application-Specific Integrated Circuits (ASICs)using Cadence Encounter. Serial-parallel FFT pipeline controland FPGA hardware reduction are also investigated.
Index Terms—FFT; fixed-point; VHDL; FPGA; autogeneration
I. INTRODUCTION
System-on-Chip (SoC) solutions using Field Programmable
Gate Arrays (FPGAs) or Application-Specific Integrated Cir-
cuits (ASICs) are desirable to lower hardware costs and keep
circuit sizes small. The component building blocks used in
an SoC are typically hand-coded in Hardware Description
Language (HDL). This endeavor is both time consuming and
prone to implementation errors. A key component used in
many SoC solutions is the Fast Fourier Transform (FFT) [1].
However, the performance of the FFT component in terms of
occupied slices, maximum frequency, and dynamic range, is
not known until the HDL for the FFT component is coded,
synthesized, and measured. It would be desirable to have a
software tool autogenerate HDL for an FFT component where
an engineer simply provides the targeted characteristics of the
FFT. In addition, the software tool should give feedback to
the engineer on the performance of the autogenerated FFT
component. Using this feedback, the engineer can focus on
the overall SoC performance and make adjustments to the FFT
component as necessary.
This paper describes a software tool, written as a MATLAB
function, that allows an FFT to be specified via its pro-
grammable/selectable parameters such as input word size, FFT
This work was sponsored by the Dayton Area Graduate Studies Institute(DAGSI) with support from the Air Force Research Lab, Sensors Directorate.
size, Decimation-in-Time (DIT) or Decimation-in-Frequency
(DIF), butterfly radix, phase factor (twiddle factor) quantiza-
tion, and stage scaling. The software tool provides a bit-true
MATLAB simulation of the FFT architecture and an option to
autogenerate Very-high-speed integrated circuit HDL (VHDL)
for vendor-independent FPGA or ASIC implementation. The
generated FFT structure is fully parallel where every butterfly
is instantiated, however each butterfly is optimized by lever-
aging constant multipliers.
Unlike commercial FFT IP cores [2]–[4], which are deliv-
ered as a black box FFT implementation that conform to a set
of parameters, this software tool generates open source VHDL
of parameterized FFTs whose contents are fully exposed and
portable to other hardware platforms. The popular SPIRAL
FFT generator [5] only produces HDL with parallel arithmetic
using DSP blocks which limits the FFT size that can fit into
a given FPGA part. Our software tool however, gives the
option of autogenerating a serial-parallel constant multiplier
butterfly architecture, thus allowing larger FFT sizes to fit in
available FPGA parts. The targeted use of the resulting VHDL
entity is for SoC solutions where the bulk of DSP and RAM
blocks [6]–[8] are reserved for components other than the FFT
component.
The paper is organized as follows. In Section II, the FFT
parameters the VHDL autogeneration tool accepts as input
are described. Section III discusses the functions, macros, and
entities developed as a VHDL package for constructing FFTs.
The autogeneration of VHDL for parallel arithmetic and serial-
parallel butterfly architectures is covered in Section IV. Section
V looks at the performance characteristics of the autogenerated
VHDL code and the portabilty of the VHDL code. Finally is
the conclusion.
II. SOFTWARE DEFINED FFT PARAMETERS
The VHDL autogenerator takes many FFT parameters as
input. This offers great flexibility in designing an FFT com-
ponent for an SoC without having to write a single line of
VHDL code. The FFT parameters entered into the VHDL
autogenerator are:
• Number of input samples
• DIT or DIF
• Butterfly radix of 2, 4, or split (auto-determined)
978-1-4673-0859-5/12/$31.00 ©2012 IEEE
237
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
x[10]
x[11]
x[12]
x[13]
x[14]
x[15]
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W4
W4
W4
W4
W0
W0
W0
W0
W0
W2
W4
W6
W0
W1
W2
W3
W0
W3
W6
W9
W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4
X[0]
X[8]
X[4]
X[12]
X[2]
X[10]
X[6]
X[14]
X[1]
X[9]
X[5]
X[13]
X[3]
X[11]
X[7]
X[15]
Fig. 1. FFT flowgraph of a 16-point DIT radix-4 phase factor map withradix-4 butterflies constructed using four generalized 2-input butterflies ashighlighted by the box
• Butterfly architecture: serial-parallel or parallel arithmetic
• Input word size
• Scaling between FFT stages (output word size)
• Fixed or variable length phase factors
• Complex multiplier built using 3 or 4 multipliers
The software tool uses one master FFT flowgraph and
determines the correct phase factor map while leveraging
radix-4 butterflies whenever possible. For example, Fig. 1
shows a 16-point DIT with radix-4 butterflies constructed with
four generalized 2-input butterflies. Fully parallel structured
FFTs have constant phase factors per butterfly, therefore the
complex multipliers reduce to a series of adders shifted by
position. Both fixed and variable length phase factors are
offered as choices for autogenerated VHDL.
Butterfly architecture can either be serial-parallel, where
data is operated on one bit at a time, or it can be processed
in parallel as bit vectors. Serial-parallel butterfly architecture
lends itself well for partitioning very large FFTs across several
FPGA parts, since each data path is only one signal/pin.
In addition, the ratio of serial-parallel throughput versus
occupied slices is favorable since every butterfly of the FFT
is instantiated and clocked simultaneously.
As data flows through an FFT, the word size grows by one
bit per stage due to the add/subtract operations’ carry bit inside
each butterfly. Depending on the application, either the output
can be truncated after the FFT, or the data can be truncated
internal to the FFT, also known as scaling, by truncating one
bit between each FFT stage.
III. DEVELOPED VHDL LIBRARY PACKAGE
The VHDL autogenerator uses our developed VHDL li-
brary package consisting of functions, macros, and entities
to construct the VHDL files for a specified FFT as shown
in Fig. 2. Functions were used for the parallel arithmetic
since they allow the recursive function calling necessary to
construct the constant multipliers which consist of adders
shifted by position. Functions also facilitate easier construction
FFT size
2n
User supplied input parameters
- FFT Size, DIT/DIF, scaling
Generate phase factor constants
Generate FFT flowgraph
- Butterfly data paths
- Phase factor locations
Analyze: trivial * / non-trivial *
Determine phase factor case
Butterfly arithmetic
- Parallel/serial operators
Phase factor multipliers
Butterfly equation
- Phase factor case
Developed VHDL library
- functions
- macros
- entities
(ADD)
(SUB)
(MUL)
process
variables
begin
operators
Stage progression of data
Adaptive scaling between stages
label
operators
port maps
signals
Timing shift register taps (serial)
Bit-reversal output signal map
Mux/Demux entity wrapper
parallel serial
Fig. 2. Autogeneration software tool showing FFT parameter input, flowgraphgeneration, trivial/non-trivial phase factor determination for each FFT stage,generation of each FFT butterfly, stage scaling, and bit-reversal output withmux/demux wrapper
of pipelining where each FFT stage is clocked using VHDL
processes containing arithmetic operators for each butterfly in
a behavioral fashion.
Entities were used for the serial-parallel butterfly archi-
tecture since they require a structural model where each
arithmetic operator clocks and processes one bit of data at
a time. The use of entities also allows generics to be used to
provide the constant bit-string (phase factor) to the constant
multiplier entity. A second generic is used to dictate whether
a buffering delay line should be placed on the entity’s output
and how many clocks the output should be delayed by. This
is necessary to keep all the butterflies of a given FFT stage
in sync such that trivial and non-trivial phase factor butterfly
results appear at the next stage with the same latency. A trivial
butterfly is one where the phase factor is either 1+0j or 0−1j,
thus reducing the complex multiply to a simple identity or
complex conjugate with swap operation respectively.
IV. AUTOGENERATION OF FFT BUTTERFLIES
After the VHDL autogeneration tool has determined the
FFT flowgraph and phase factor locations, the tool creates
a set of VHDL files that will synthesize into a usable FFT
238
1 2 3 4
Pass 1Compute needed variables and bit vector sizes
Begin Butterfly
Generation
Case
Pass 2
U1: Pass-thru
L1: Pass-thru
U1: Pass-thru
L1: Multiply
U1: Multiply
L1: Multiply
U1: Pass-thru
L1: Pass-thru
case 1) U2: add/add, L2: subtract/subtract
case 2) round(L1), U2: add/add, L2: subtract/subtract
case 3) U2: round(add/add), L2: round(subtract/subtract)
case 4) U2: add/subtract, L2: subtract/add
Butterfly Output
U1
L1
U2
L2
WU
WL
1 + 0j
1 + 0j
1 + 0j
Complex *
Complex *
Complex *
1 + 0j
0 - 1j
WU
WL
1 2 3 4
Fig. 3. Autogeneration of FFT butterflies using a two pass approach wherepass one determines the variables required for the butterfly and pass twogenerates the arithmetic operator statements based on trivial/non-trivial phasefactor cases
component. The files include the top level entity for the FFT
component and individual files for each FFT stage. The top
level entity consists of concurrent statements for each FFT
stage, plus a bit-reversed signal assignment from the last FFT
stage to the FFT output, thus creating a component with in-
order input and in-order output signals.
The tool autogenerates each generalized 2-input butterfly
using a two pass approach as shown in Fig. 3. Pass one
computes the needed variables and bit vector sizes while pass
two generates the function or entity arithmetic operators for the
butterfly. Every butterfly in an FFT falls into one of four cases
depending on the upper and lower phase factors and whether
they are a trivial or non-trivial complex multiplication. The
resulting case determines if the input data is passed through
as is or requires a complex multiplication, and determines
how the phase factor multiplications are combined into the
butterfly’s output.
An FFT component using serial-parallel butterfly architec-
ture requires pipeline timing circuitry to control:
1) Timing the data as it passes through the FFT
2) Indicating when each FFT result is ready to be latched
3) Resetting each FFT stage at the correct time
The following equations are used by the software tool
to determine the timing shift register length, tap points for
FFT stage resets, and minimum number of pipeline clocks
required for continuous operation of the FFT component. Each
FFT stage requires a specific number of clocks for its fixed-
point result to appear on its output, with the same fractional
precision, such that
L(n) =
{
1 stage n is all trivial
Tf + 2 stage n has some non-trivial(1)
where L(n) is the number of clocks required for FFT stage n
to have its result appear on its output and Tf is the number
of fraction bits that represent the phase factors of the FFT.
Therefore, the overall timing shift register length is
Ls = 1 +Ni +Nf +Ns +
Ns∑
n=1
L(n) (2)
where Ni and Nf are the number of integer and fraction bits
respectively that represent the input data words to the FFT,
Ns is the number of FFT stages, and L(n) is taken from (1).
The time it takes for the first load signal to propagate
through the timing shift register and appear as the ready signal
is Ls, however subsequent data sets only require Ls − Ns
clocks between load signals.
The number of clocks each FFT stage requires for its input
to start appearing on its output is
K(n) =
{
2 stage n is all trivial
4 stage n has some non-trivial(3)
therefore each FFT stage n reset tap point is given by
T (n) =
{
−1 for n = 1T (n− 1) +K(n− 1) for n = 2 . . .Ns
(4)
where the timing shift register is indexed as 0 . . . Ls − 1. It
should be noted that (3) includes one extra clock required for
each FFT stage to grow the word size of the data value by
one bit. Also, T (1) should be connected directly to the load
signal from the circuit outside the FFT component.
V. PERFORMANCE AND PORTABILITY
To evaluate the performance of the autogenerated VHDL
FFT component, various FFTs were generated and synthesized
using Xilinx ISE 13.2, as shown in Table I, with our largest
available Virtex-6 LXT -3 speed grade FPGA. One can clearly
see that the parallel arithmetic FFT F1 has a much higher
throughput of 23.04 Gs/s versus 2.369 Gs/s for the serial-
parallel FFT F2, however the hardware cost is ≈2.66 times
greater than FFT F2. Moreover, the Virtex-6 LXT using
SelectIO [9] can only input data at 1.4 Gs/s, therefore FFT
F1 is highly data starved. The throughput of serial-parallel
FFT F2 is closer to the maximum I/O data rate of the FPGA
part and is therefore a more appropriate solution.
To analyze serial-parallel butterfly architecture throughput
versus FFT size, FFT F3 was compared to FFT F2 in Table I.
One can see that the latency for the data to flow through the
pipeline increases from 81 to 125 clocks, however the through-
put remains approximately the same at 2.426 Gs/s. This occurs
because the larger front-end of FFT F3 compensates for the
longer pipeline. Also, the hardware cost of serial-parallel FFT
F3 is approximately the same as parallel arithmetic FFT F1,
even though 512-point FFT F3 has four times the number of
data paths and butterflies.
239
TABLE IVARIOUS FFTS WITH 11-BIT INPUT WORDS AND 14-BIT PHASE FACTORS
Point Butterfly Math Slices Pipeline SampleSize Radix Type Used Clocks Rate
F1 256 4 parallel 23862 1 23.04 Gs/sF2 256 4 ser-par 8963 81 2.369 Gs/sF3 512 2 ser-par 24677 125 2.426 Gs/s
Maximum Throughput
SLICEs Used
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Parallel Logic (unscaled)
Parallel Logic (scaled)
Parallel DSP (scaled)
SPIRAL DSP (scaled)
Serial Logic
CoreGen
Fig. 4. Normalized comparison of various Q1.8 64-point radix-2 FFTimplementations using 8-bit fixed length phase factors (288 DSP slices wereused for both Parallel DSP and SPIRAL DSP)
Various 64-point radix-2 FFTs were also generated
and compared to SPIRAL and Xilinx CoreGen in terms
of maximum throughput and slices used. Variations of
scaled/unscaled, and Logic/DSP multipliers were generated.
As Fig. 4 shows, the slices used for unscaled Logic multipliers
is larger than scaled, however the throughput is the same,
indicating the number of adders required to implement the
FFT has a greater impact on hardware cost than the size of the
adders. By simply changing one line in the developed VHDL
library, DSP multipliers were generated instead of Logic
multipliers, resulting in a DSP based FFT that outperformed
SPIRAL DSP both in terms of higher throughput and lower
hardware cost.
To test the portability of the autogenerated VHDL, FFT
F1 from Table I was imported without modification into
Cadence Encounter 6.1.5 using the IBM 65 nm 6-metal digital
ASIC process. The ASIC cell layout completed successfully
as shown in Fig. 5. The FFT cell measures 1.8 mm per side,
contains 482,875 standard cells, has a clock fanout of 63,488
gates, and took approximately 5 hours to complete. The key
for successful importation into Cadence was generated VHDL
code consistency and modularity of the developed VHDL
library package.
VI. CONCLUSION
This paper has described an FFT autogeneration tool that
accepts a set of FFT parameters and generates a VHDL
component to be synthesized for use in FPGAs. Unlike propri-
Fig. 5. 256-point radix-4 FFT imported without modification to CadenceEncounter using IBM 65 nm 6-metal process (the cell size is 1.8 mm perside)
etary FFT IP cores, the VHDL code generated is completely
viewable and modifiable with no reliance on any particular
FPGA vendor. Both parallel arithmetic and serial-parallel
butterfly architectures can be generated where serial-parallel
allows larger sized FFTs to fit inside available FPGA parts.
A 512-point serial-parallel FFT was successfully generated
and easily fits available FPGA parts with regards to I/O
throughput and slices used. A comparison of 64-point radix-2
FFTs was performed and has shown that the autogenerated
parallel arithmetic DSP based FFT had higher throughput and
lower hardware cost than the equivalent SPIRAL generated
FFT. In addition, the generated VHDL code can be imported
without modification into Cadence Encounter for rapid FFT
cell creation.
REFERENCES
[1] J.W. Cooley and J.W. Tukey, ”An Algorithm for the Machine Calculationof Complex Fourier Series,” Math. Computation, Vol. 19, 1965, pp. 297–301.
[2] DSP Cores from IP Cores, Inc. (http://www.ipcores.com).[3] FFT IP from Dillon Engineering, Inc. (http://www.dilloneng.com).[4] C. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, ”Multidimensional
DFT IP Generator for FPGA Platforms,” IEEE T. Circuits-I, Vol. 58,No. 4, Apr. 2011, pp. 755–764.
[5] DFT/FFT IP Core Generator from Carnegie Mellon Univer-sity (http://www.spiral.net).
[6] J.G. Proakis and D.G. Manolakis, Digital Signal Processing, 4th ed.,Upper Saddle River: Pierson Prentice Hall, 2007, pp. 449–461.
[7] L. Wenqi, W. Xuan, and S. Xiangran, ”Design of Fixed-Point High-Performance FFT Processor,” ICETC 2010, Vol. 5, 2010, pp. V5-139–V5-143.
[8] K. Maharatna, E. Grass, and U. Jagdhold, ”A 64-Point Fourier TransformChip for High-Speed Wireless LAN Application Using OFDM,” IEEE
J. Solid-St. Circ. , Vol. 39, No. 3, Mar. 2004, pp. 484–493.[9] Virtex-6 LXT FPGAs from Xilinx, Inc. (http://www.xilinx.com).
240