Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
UNLV Retrospective Theses & Dissertations
1-1-2005
Design and implementation of a fast Fourier transform Design and implementation of a fast Fourier transform
architecture using twiddle factor based decomposition algorithm architecture using twiddle factor based decomposition algorithm
Bhaarath Kumar University of Nevada, Las Vegas
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds
Repository Citation Repository Citation Kumar, Bhaarath, "Design and implementation of a fast Fourier transform architecture using twiddle factor based decomposition algorithm" (2005). UNLV Retrospective Theses & Dissertations. 1837. http://dx.doi.org/10.25669/uqf7-sxec
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected].
DESIGN AN D IM P LEM EN TATIO N OF A FAST FOURIER TRANSFORM
ARCHITECTURE USING T W ID D LE FACTOR BASED
DECOMPOSITION ALG O R ITH M
by
Bhaarath Kumar
Bachelor o f Engineering University o f Madras, India
2002
A thesis submitted in partial fu lfillm ent o f the requirement for the
Master of Science Degree in Electrical Engineering Department of Electrical and Computer Engineering
Howard R. Hughes College of Engineering
Graduate College University of Nevada, Las Vegas
August 2005
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1429711
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
UMIUMI Microform 1429711
Copyright 2006 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 300 North Zeeb Road
P.O. Box 1346 Ann Arbor, Ml 48106-1346
Reproduced witfi permission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout permission.
m m i.Thesis ApprovalThe Graduate College University of Nevada, Las Vegas
July 12 .2005
The Thesis prepared by
Bhaarath Kumar
Entitled
"Design and Implementation of a Fast Fourier Transform Architecture
_________ Using Twiddle Factor Based Decomposition Algorithm"__________
is approved in partial fulfillment of the requirements for the degree of
Exam
-V.Examination CommiMee Member
G radiM e College Faculty Representative
Master of Science in Electrical Engineering
Examitmtion Cmnmittee Chair
Dean o f the Graduate College
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
Design and Implementation of a Fast Fourier Transform Architecture using Twiddle Factor Based decomposition Algorithm
by
Bhaarath Kumar
Dr. Yingtao Jiang, Examination Committee Chair Assistant Professor
Department o f Electrical & Computer Engineering University o f Nevada, Las Vegas
W ith the advent o f signal processing and wireless communication mobile platform
devices, the necessity for data transformation from one form to another becomes an
unavoidable aspect. One such mathematical tool that is w idely used for transforming time
and frequency domain signals is Eourier Transform. Fast Fourier Transform (EFT) is
perhaps the fastest way to achieve transformation. Many algorithms and architectures
have been designed over the years in an attempt to make EFT algorithms more efficient
and to target many applications.
The main objective o f our work is to design, simulate and implement an architecture
based on the Twiddle-Eactor-B ased decomposition EFT algorithm. The significant
feature o f the algorithm is its effective memory access reduction that accounts to be as
much as 30% lesser than in any other conventional EFT algorithms. As a result o f this
memory reduction, this algorithm is said to be more power efficient and is said to
compute in much lesser number o f clock cycles than other algorithms developed.
Ill
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF FIGURES
Figure L I Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components andusing N twiddle factors........................................................................................9
Figure 1.2 Data Flow graph o f an 8-point DFT calculated by splitting the N pointinput into two N/2 parts containing even and odd components andusing only N/2 tw iddle factors..........................................................................10
Figure 1.3 Data Flow graph o f an 8-point FFT, showing the entire decompositionuntil 2-point DFT stage is reached. The stage w ith is the laststage o f decomposition w ith all butterflies being 2-point DFTs.................. 11
Figure 2.1 A radix-2 Decimation-In-Time (D IT) butterfly structure.............................. 18Figure 2.2 Flow graph o f an 8-point D IT FFT structure using butterflies......................18Figure 2.3 A radix-2 Decimation-In-Frequency (D IF) butterfly structure.................... 20Figure 2.4 Flow graph o f an 8-point D IF FFT structure using butterflies......................20Figure 2.5 Flow graph o f a 16-point FFT structure based on Twiddle-Factor-
Based algorithm ................................................................................................. 23Figure 3.1 Flow graph o f a 16-point FFT structure based on Twiddle-Factor-
Based a lgorithm ................................................................................................. 27Figure 3.2 Pseudo code for the (log2N - l) stages decomposition..................................... 34Figure 3.3 B lock design representation o f (logzN -l) stages based on Twiddle
Factor decomposition a lgorithm ......................................................................36Figure 3.4 (log2N)‘ stage decomposition structure for A = 16........................................ 39Figure 3.5 Pseudo code fo r the (log2N - l) stages decomposition.....................................40Figure 3.6 V H D L Block implementation o f (log2N f^ stage based on Twiddle
factor decomposition algorithm .......................................................................42Figure 3.7 B lock implementation o f (log2N - l) stages and (log2N) stage based
on Twiddle factor decomposition algorithm.................................................. 43Figure 3.8 B lock level representation o f I-w ord R A M cell fo r storing 52-bit data... 45Figure 3.9 B lock representation o f a R A M decoder..........................................................46Figure 3.10a Basic group formation o f R A M cells using decoders.................................... 47Figure 3.10b Hierarchical group formation o f R A M cells using decoders........................ 47Figure 3.11 B lock representation o f Read Only Memory (ROM) and its controller.... 50Figure 3.12 Complex multiplication implementation using method I .............................53Figure 3.13 Complex m ultiplication implementation using method 2..............................54Figure 3.14 Real m ultiplication implementation stages......................................................55Figure 3.15 Complex multiplication implementation using method 2 .............................. 55Figure 3.16 Block level Data-path fo r group I fo r twiddle factor based architecture... 57Figure 3.17 B lock level Data-path fo r group2 for twiddle factor based architecture... 58Figure 3.18 B lock level Data-path fo r group3 for twiddle factor based architecture....59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The real focus o f the design is to build architecture to map this efficient algorithm on
to hardware retaining the maximum efficiency o f the algorithm. The complete design,
simulation and testing is done using A ctive -H D L tool which is a V H D L package
designed. The architecture designed is found to retain the memory savings capability o f
the algorithm thus enabling power efficiency.
IV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.1 Graphical comparison fo r ROM access between twiddle factor basedarchitecture and conventional D IT or D IF based architectures.................69
Figure 4.2 Graphical comparison fo r R A M access between twiddle factor basedarchitecture and conventional D IT or D IF based architectures.................71
Figure 4.3 Graphical comparison fo r m u ltip lie r operation usage between twiddleFactor based architecture and conventional D IT or D IF basedarchitectures......................................................................................................... 72
Figure A - I I V H D L block representation o f (log 2N - l) stages R A M Addressblock based on Twiddle factor decomposition algorithm generating10-bit R A M address sequences qa and qb.......................................................80
Figure A -I2 V H D L block representation o f (log 2N )‘ stage R A M AddressGeneration block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa_logn2and qb_logn and 6-bit Temporary register bank address qaddr....................81
Figure A-I3 V H D L design o f a 7-word R A M cell fo r storing imaginary parto f data.................................................................................................................... 82
Figure A -I4 V H D L design o f a 7-word R A M cell fo r storing real part o f data............... 82Figure A-I5 V H D L design o f a 2x4 Decoder block choosing one among four
outputs based on a 2-bit input. The input port is given as de_inp_2x4_real(l :0) and output ports are represented bye_out_2x4_real( 3 :0)............................................................................................ 83
Figure A - I6 V H D L design o f a 64-words RO M and a decoder along w ithinterconnects outputting 52-bits real and imaginary twiddle factors..............83
Figure A -I7 V H D L design o f a 64-words R A M using four 76-words R A M cellsAnd a decoder along w ith interconnects outputting 52-bits realand imaginary data................................................................................................84
Figure A - I8 V H D L design block showing the data-path s for group I, group2 andgroup3. The output signals for each group are represented by b_real_put_groupX and b_imaginary_out_groupX w ith Xrepresenting individual group numbers............................................................ 85
Figure A -I9 V H D L design block showing the resolution function designedfor the data-path to choose between the three data-path groups and also for determining the Read and Write operations o f the R A Mblock and other resolution blocks...................................................................... 86
Figure A-110 V H D L design block showing R A M Address generation block w ith resolution functions that aid in enabling and disablingvarious signals at different instances o f tim e................................................... 87
Figure A -H I V H D L design block showing R A M Address generation block, the R A M blocks, the ROM memory controller block, the RO M blocksand all the other interconnects............................................................................88
Figure A -H 2 V H D L design block partially showing interconnection betweenall the blocks o f the Twiddle factor based FFT architecture..........................89
VI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOW LEDGEMENTS
I take this opportunity to thank my advisor Dr. Yingtao Jiang, for his guidance and
support.
I am grateful to professors in my thesis committee for all their valuable suggestions. I
would like to thank M r. Stan Hanel fo r providing access to A ctive -H D L whenever I
needed it most.
I wish to thank all my friends and fam ily members.
vn
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OE CONTENTS
ABSTR AC T........................................................ i i i
L IST OF FIG URES...................................................................................................................... v
AC KN O W LEDG EM ENTS.......................................................................................................v ii
CHAPTER 1 IN TR O D U C TIO N .......................................................................................... II I Thesis Objective................................................................................................................ 11.2 Fourier Analysis................................................................................................................2
1.2.1 The Fourier Series....................................... 21.2.2 Fourier Transform.................................................................................................. 41.2.3 Discrete Fourier Transform................................................................................... 51.2.4 Fast Fourier Transform ..........................................................................................7
1.2.4.1 Mathematical Calculation for FFT Computational Com plexity 81.2.5 Power Aware D esign........................................................................................... 12
1.3 Organization o f the Thesis W rite-up............................................................................13
CHAPTER 2 FAST FOURIER ALG O RITHM S AN D ARCH ITECTU RE.................152.1 Introduction..................................................................................................................... 152.2 Decimation-In-Time (D IT) FFT A lgorithm ................................................................172.3 Decimation-In-Frequency (DIF) FFT A lg o rith m ......................................................182.4 Decimation Based on Tw iddle Factors.......................................................................20
2.4.1 Decimation Procedure .......................................................................22
CHAPTER 3 ARCH ITECTU RAL DESIGN A N D IM P LE M E N TA TIO N .................253.1 Overview..........................................................................................................................253.2 Design Target.................................................................................................................. 263.3 A lgorithm Setup..............................................................................................................263.4 Random Access Memory (R AM ) Address Generation Block Design...................28
3.4.1 Formulae o f Computational Complexity for the Loop Structures for (log2N - l ) Stages o f Decomposition...................................................................30
3.4.2 Pseudo Code Implementation.............................................................................333.4.3 Architectural B lock Design and Implementation............................................ 34
3.5 Formulae o f Computational Complexity fo r the Loop Structures for(log2N)‘ Stage o f Decomposition................................................................................373.5.1 Memory Reduction Technique.........................................................................38
vin
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.5.2 Pseudo Code Implementation............................................................................. 403.5.3 Architectural B lock Design and Im plem entation............................................ 40
3.6 Random Access Memory Design................................................................................. 443.6.1 Read Only Memory (ROM) design................................................................... 48
3.7 Data-Path Design............................................................................................................503.7.1 Complex M u ltip lie r Implementation................................................................. 523.7.2 Group I Design and Implementation.................... 563.7.3 Group2 Design and Implementation.................................................................. 563.7.4 Group3 Design and Implementation.................................................................. 58
3.8 Summary.......................................................................................................................... 59
CHAPTER 4 AR C H ITEC TU R AL S IM U LA TIO N RESULTS AN D DISCUSSION 614.1 Output Simulation Results............................................................................................ 614.2 R A M Address Generation Block Parameters............................................................ 614.3 ROM Address Generation Block Parameters............................................................ 634.4 Random Access Memory Parameters..........................................................................634.5 Read Only Memory B lock Parameters....................................................................... 644.6 Data-Path B lock Parameters.........................................................................................644.7 Clock Generation and Frequency Calculations.......................................................... 654.8 Memory Access Feature............................................................................................... 674.9 M u ltip lie r Operational Savings Using Tw iddle Factor Based Architecture 714.10 Design Issues and Challenges.................................................................................... 734.11 Architectural Salient Features and Drawbacks........................................................76
CHAPTER 5 CONCLUSIONS AN D SUGGESTIONS FOR FUTURE W O R K 775.1 Summary o f W ork.......................................................................................................... 775.2 Suggestions fo r the Future............................................................................................77
APPENDIX V H D L Design o f Various Architectural B locks......................................80B IB L IO G R A P H Y....................................................................................................................... 90V IT A ............................................................................................................................................. 93
IX
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1
INTRO DUCTIO N
l . I Thesis Objective
W ith increasing demand fo r mobile computing devices, conversion o f data between the
time and frequency domain has become vital. The Fast Fourier Transform (FFT) is one o f
the w idely used digital signal processing algorithms for this purpose. Numerous Fast
Fourier Transform algorithms have been developed over the years. Architectures, which
help realize these algorithms, have found applications in diverse areas as:
communications, signal processing, instrumentation, biomedical engineering. Sonics and
acoustics to name a few. The goal o f our work is the architectural design and
implementation o f a Fast Fourier Transform (FFT) processor, mapping an algorithm
whose decomposition is uniquely based on Tw iddle factors unlike the conventional
Decimation-In-Time (D IT ) or Decimation-In-Frequency (DIF). W ith numerous
architectures already in existence, one main aspect that differentiates this work from
other architectures is the algorithm that is used for the mapping purpose. The Twiddle
Factor based decomposition algorithm ensures a reduction in memory access by as much
as 30% [ I ] [10]. Hence, an architecture u tiliz ing this algorithm is expected to have lower
power consumption than any other conventional algorithm based architecture. The
memory reduction comes in the form o f twiddle factor access from the Read Only
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Memory (ROM), where it is generally stored and retrieved. The architecture is designed,
simulated and tested in Very large-scale integrated circuits Hardware Description
Language (V H D L) environment.
1.2 The Fourier Analysis
1.2.1 The Fourier Series
Consider a sequence x(n) that is periodic w ith a period N so that x{n) = x(n + kN) for
any integer value o f k. Such a sequence cannot be represented by its z-transform, since
there is no value o f z for which the z-transform w ill converge. In such situations, this
sequence can be expressed using the Fourier Series (FS) tool [4] [13] [14] [15] [16]. Any
periodic signal can be expressed as a sum o f sinusoidal and co sinusoidal oscillations.
This decomposition is termed as Fourier Series (FS). By reversing this procedure, a
periodic signal can be generated by superimposing sinusoidal and co-sinusoidal waves.
Fourier series make use o f the orthogonal relationships o f the sine and cosine functions.
The Fourier series is extremely useful in breaking down an arbitrary periodic function
into a set o f simple terms that can be plugged in, solved individually and thep recombined
to obtain the solution o f the original problem. The general function is as follows:
/(■ ) = Oq + ^ ( a „ c o s ^ ^ + 6„ s in -^^) (1.2.1.1)n=l L L
where ao, a„, K are the Fourier magnitude coefficients o f the corresponding sinusoidal
and co-sinusoidal waves and 2L represents the fundamental frequency given by 2 ;r /T
rad/sec. The Fourier Coefficients can be determined from the fo llow ing integrals:
1 ^ao = — j f ( x )d x ( 1.2.1.2)
2L ,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1 r flTDC7i = — J / ( x ) c o s dx M = 1,2,... (1.2.1.3)
-L
M = 1,2,... (1.2.1.4)- L
For non-periodic functions, one can argue that they are periodic w ith an in fin ite period,
which is oo. The Fourier series then becomes Fourier Integral given as follows:
f ( x )= ^[a{w)co^m+b{a)?,mcox]dco (1.2.1.5)0
where,
a{co) = — ^ f { x ) cos caxdx (I.2 .I.6 )
1b{co )- — ^ f{x )sm o }xd x (1.2.1.7)
71
Thus, Fourier series are made up o f sinusoids, all o f which have frequencies that are
integer multiples o f some fundamental frequency. A great thing about using Fourier
series on periodic function is that the first few terms often are a pretty good
approximation to the whole function, not just the region around a special point. Fourier
series are used extensively in many major engineering applications, especially for image
processing and signal processing applications. They are also used in solving ordinary and
partial differential equations (heat conduction, wave theory) and also for various kinds o f
spectroscopy. Finding the coefficients o f a Fourier series is sim ilar to performing the
spectral analysis o f that function.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.2.2 Fourier Transform
The Fourier Transform (FT) is basically generalization o f the Fourier series. The Fourier
Transform provides the means o f transforming a continuous time signal into its
corresponding frequency domain. Instead o f sinusoidal and co-sinusoidal terms used in
Fourier series, Fourier Transform uses exponentials and complex numbers. The Fourier
transform X (f) o f a continuous time function x(t) can be expressed as follows:
% ( / ) = j ( 1.2.2.1)
In general, X(f) and x(t) are complex valued functions, 7 being imaginary unity number,
defined as the square root o f - I and 2 z r / being the angular frequency range associated
with the signal. From the defin ition o f the Fourier integral, not every function x(t) has a
transform X(f) [2] [20] [21]. W hile the exact conditions for convergence o f functions are
not known, two conditions that are w idely considered sufficient for convergence
(Bracewell, 1986) are:
Condition I:
The integral o f | / (x ) | from to coexists [2] [20]. That is.
j|/(x )|d !x< o o ( 1.2.2.2)
Condition 2:
Any discontinuities m f(x ) are bounded [2].
Some o f the properties associated w ith Fourier Transform are Linearity, Scaling, Time
shifting. Frequency shifting. Symmetry, Modulation, Differentiation in time and
Convolution. The inverse Fourier Transform that converts frequency domain signal into
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
time domain performs the exact opposite functionality of Fourier Transform and has the
same complexity as the earlier. The Inverse Fourier Transform is defined as
/ ( 0 = — (1.2.2.3)2n
The Fourier Transform is used widely for image analysis, image filtering, and image
reconstruction and image compression.
1.2.3 Discrete Fourier Transform
The Fourier Transform described in the previous section can be compared to an analog
tool as it mainly deals w ith continuous signals and this is evident from the integral used
in equation (1.2.2.2). For the special case in which the sequence to be represented is o f
fin ite duration, it is possible to develop an alternative Fourier representation, referred to
as Discrete Fourier Transform (DFT). DFT is thus the Fourier representation o f finite-
length sequence which is itself a sequence rather than a continuous function, and it
corresponds to samples equally spaced in frequency o f the Fourier Transform o f the
signal [4]. Thus the Discrete Fourier Transform is used in the case where both the time
and frequency variables are discrete. In short. Discrete Fourier Transform (DFT) also
known as Finite Fourier Transform is w idely used to analyze the frequencies contained in
a sampled signal, solve partial differential equations, and to perform other operations
such as convolutions [6] [17] [18] [19]. The two important reasons for using Discrete
Fourier Transform over Fourier Transform are:
• The input and the output o f the DFT are both discrete values making it
convenient for computer manipulations.
• There is an algorithm called Fast Fourier Transform, which is a speedy way o f
computing the Discrete Fourier Transform.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The Discrete Fourier Transform is defined as:
X{ k ) = y jc(«) , 0 < t < A - l (1.2.3.1)n=0
equation ( 1.2 .3.1) can be rewritten as
N - \
X { k ) = Y .^ {n )W ^ , 0 < k < N - \ (1.2.3.2)n=0
where
(1.2.3.3)
with N being the number o f time samples or number o f frequency samples, x(n) being
input signal amplitude at time n and termed as Twiddle Factor. Thus, calculus is not
needed to define the DFT or its inverse, and w ith fin ite summation lim its, we do not
encounter d ifficu lties w ith infinities. Moreover, in the fie ld o f D ig ita l Signal Processing,
signals and spectra are processed only in sampled form, so that the DFT is what we really
need fo r computational purpose [7]. In simple terms, DFT is computationally less
intensive to compute than the Fourier Transform as can be seen below in this section. A t
the same time, the basic concepts are the same [7]. From equation (1.2.3.1) that for each
value o f k, direct computation o f X(k) involves N complex multiplications (4N real
multiplications) and N -I complex additions (4N-2 real additions) [8]. Consequently, it
2 7takes N complex multiplications and N -N complex additions to compute all N values o f
the DFT [8]. Therefore, roughly 2N^ or O(N^) are required to calculate the DFT o f length-
N sequence [2]. The Inverse Discrete Fourier Transform (IDFT) performs the opposite
functionality that o f the DFT and involves the same complexity and the same number o f
computations as the later. The Inverse Discrete Fourier Transform is defined as:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
x{n) = — Ÿ , X { k ) W - ' ' \ 0 < k < N - X (1.2.3.4)^ n=0
Thus the Discrete Fourier Transform replaced the Fourier Transform resulting in
computationally capable algorithm, which has found itself applicable in wide range of
digital signal processing and image processing fields. Another aspect is the size o f the
memory required for an A-point DFT calculation. Since, each input term in equation
( 1.2 .3.2) needs to be preserved until the last output term has been computed, the
m inimum memory locations required is 2A [2].
1.2.4 Fast Fourier Transform
Direct computation o f DFT as seen from the previous section, consumes O(N^)
computations for an A-point operation. Though this method o f computation results in the
correct output, the efficiency o f this method when compared to the one to be discussed in
this section is very less. The main reason fo r the inefficiency o f the DFT algorithm is
because it does not explore the Symmetry and Periodicity properties o f the Twiddle
factor . The two properties are defined as:
Symmetry property: = -W ^ (1.2.4.1)
Periodicity property: =W ^ (1.2.4.2)
The Fast Fourier Transform algorithms, utilizes the above two properties thereby
reducing the total number o f computations from O(N^) to 0 (N log2 N) for an A-point
DFT. Due to huge difference in the computational complexity between direct DFT and
that o f FFT algorithm calculations, FFT has rapidly replaced DFT as the pragmatic tool
currently being used in every area o f science and engineering that requires
transformation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.2.4.1 Mathematical Calculation for FFT Computational Complexity
This section introduces Fast Fourier Transform algorithm’s computational advantage
over direct Discrete Fourier Transform calculation. The equations (1.2.3.2) and (1.2.3.3)
are reintroduced for derivation purpose.
N - \
X( k ) = Y ,x in )W lf , 0 < k < N - l (1.2.4.1.1)M=0
(1.2.4.1.2)
For derivation purpose, the value o f N is chosen to be even. Hence N can be represented
in the form N =2 ‘ , where ‘a ’ is a positive integer. Since N is chosen to be even, the entire
sequence o f x(n) can be split into two sequences each o f length N/2 w ith one o f the
sequence containing even components o f jc while the second sequence containing the odd
components. Thus equation (1.2.4.1.1) can be split into two summations each o f length
N/2 and can be rewritten as follows:
N - 2 N - l
% (t) = ^:c(n)W;* + (1.2.4.1.3)odd=\
From [2], i f 2a represents the even components and (2 a + l) the odd components w ith
a = 0 ,1 ,2 ... N/2-1 then equation (1.2.4.1.3) can be written as
% (t)= ^z(2a)W ^'^+ ^%(2a + l)W "'+')* (1.2.4.1.4)(7 = 0
X(^) = '';^x(2a)(W^)'^ +'';^z(2a + l)(W^)'^(W;) (1.2.4.1.5)fl=0 0=0
But W]j can be proved to be equal to . Hence, equation (1.2.4.1.5) becomes
%(t) = + (W ) T L D (^)G ^;% ) (1.2.4.1.6)n—0 0=0
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with k=0, 1, 2..., N -l. We have stated that the DFT o f a sequence is periodic in its length,
which means that the (A ^ j-p o in t DFTs o f (a) and need to be calculated
for only N/2 o f the N values o f k. For each value o f k, the number o f addition operations
taking place in each o f the two summations is N/2. Flence total number o f addition
operations is 2(N/2) =N. In addition to the N addition operations, there are N operations
o f multiplications by (W^ ) . Hence fo r N values o f k, the total number o f operations is N^
additions and N^ multiplications for a total o f 2N^ operations or can be just represented as
O(N^) as the index value 2 has comparatively lesser value. This represents the
computational complexity when direct DFT method is employed. For large values o f N,
direct DFT computation becomes tedious and not practical to be implemented by digital
systems. For N = 8, equation (1.2.4.1.6) can be diagrammatically represented using
butterfly structures which depict the computation o f X(k) at the output from N values o f
x(n )â t the input.
x(0)
x(2)
x(4)
x(6)
x(1)
x(3)
x(5)
x{7)
N/2-pointDFT
N/2-pointDFT y/=\t :
X(0)
X (1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
Figure 1.1: Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components. The dots represent points wherein addition operations take place. The integers next to the large arrow-marks represent multiplications that take place due to ) [2]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Using the property o f periodicity, the FFT algorithms calculate DFT o f an iV-point input
by splitting it into even and odd components just as direct DFT is computed, just that
instead o f calculating k value from 0 t i l l N - l, they only calculate k value from 0 until
N/2-1 and the calculated values are re-used for k=N/2, N/2+1, , N - l. Hence, the total
number o f operations get reduced from O(N^) fo r direct DFT computation to 0(N^/2)
resulting in 50% lesser computation. Additional computational reduction can be brought
using the Tw iddle factor ) . Using the property o f Symmetry, W,k+NI2N -VFk, . Hence,
instead o f calculating (W^J) fo r k values from 0 until N - l, it is enough i f we calculate for
values o f k from 0 t i l l N/2-1 and the remaining values o f k w ill be symmetric to the
previous values. As a result, fo r an A-point input, only N/2 values o f tw iddle factors have
to be calculated. Thus figure 1.1 can be redrawn as
x(0)
x(2)
x{4)
x(6)
x(1)x(3)
x(5)
x(7)
N/2-pointDFT \ \ y V * \
N/2-pointDFT
w l *
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
Figure 1.2: Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components and using only N/2 twiddle factors. The dots represent points wherein addition operations take place. When compared to the figure 1.1, only four, that is only N/2 number o f tw iddle factors are calculated while the remaining N/2 are just represented as sign due to symmetry property.
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The in itia l process was to divide A-point DFT into two A/2 point DFTs o f even and odd
components. The same process can be repeated thereby breaking down A/2 points further
down to two N/4 points o f even and odd DFTs and the process o f division can be
repeated until, it results in 2-point DFTs, which signals the end o f the splitting process as
calculating a 2-point D FT is very simple. The number o f stages o f such decomposition
for an A-point input is logg A . The FFT decomposition is shown diagrammatically as:
x(0) e
x (2 )e
x(1) • x(3)&
x(5)&
x (7 )e
•X (0 )
•X (7 )
Figure 1.3: Data Flow graph o f an 8-point FFT, showing the entire decomposition until 2- point DFT stage is reached. The stage w ith W \ is the last stage o f decomposition w ith all butterflies being 2-point DFTs.
Each o f the A/2 2-point FFT’ s requires one addition, one subtraction and there are A/2
twiddle factor multiplications per stage. One the whole, every stage requires 0(N )
operations per stage. Since there are logz A stages on the whole, the total number o f
operations for an A-point FFT becomes 0 {N \o g j A ). Thus, when compared to direct
DFT computation that has a operational complexity o f O(N^), FFT algorithms only have
0(Nlog2N) which results in 50% and higher savings in computation making FFT one o f
the fastest and computationally most efficient form o f transformation algorithms
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
developed w ith the significance clearly fe lt in applications requiring large values o f N
computation. FFT algorithms have virtually overthrown computation through direct DFT
method and have found usage in a wide spectrum o f fields ranging from sonar/radar
detection, cellular communication, digital signal processing applications, image
processing, designing High Defin ition Television (HDTV), medical imaging to name a
few. Thus FFT algorithms have become an integral part o f many scientific and
engineering applications and hence there is an ever-increasing necessity to design more
efficient FFT algorithms as well as architectures that map these efficient algorithms onto
realizable hardware components.
1.2.5 Power Aware Design
W ith the semiconductor industry well into the deep sub-micron era, lim itations on
physical dimensions have led to unprecedented challenges in terms o f behavioral aspects
o f devices. Greater challenges in terms o f power dissipation, which posed triv ia l
challenges in the earlier stages, are now posing immense constraints on devices being
designed. Heat dissipation from such devices has become a fie ld o f research by itself,
w ith the problem seeming to worsen as the device sizes shrink. The advent o f mobile
computing and communication devices has taken system design to a new high. In
addition to designing and improving circuits for increased computational ability w ith an
increased processing capability, higher operational clock frequency, higher throughput
and increased packaging density, hardware circuits in deep sub-micron era need to be
designed for lower power consumption to improve the life o f battery on which these
mobile devices solely depend, thereby reducing constraints on heat dissipation causing
circuit breakdowns. Power reduction techniques are implemented at every level o f design
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
abstraction starting from device modeling all the way until circuit design at the gate and
transistor level and layout. Special techniques have also been developed at the fabrication
level. W ith so much importance for design and implementation o f power aware
architectures and circuits, we take up the task o f designing and simulating architecture
that maps a unique FFT algorithm mainly aimed at memory access reduction, which
could ultimately lead to lesser power consumption at the architectural level.
1.3 Organization o f the Thesis Write-up
The entire report is organized into five chapters w ith the first chapter giving a basic in-
depth into the necessity fo r this work and few o f the technical startups needed to
understand the underlying concepts behind our work.
Chapter 2 gives a comprehensive description about various Fourier algorithms and
architectures based on these algorithms w ith specific focus on Fast Fourier Transform
algorithms. Detailed explanation along w ith supporting mathematical equations has been
provided for clarity purpose. Chapter 2 also gives a comprehensive coverage o f the
Twiddle Factor Based FFT algorithm, which forms the background o f our architectural
design. A clear insight on how the decimation is performed is also given.
Chapter 3 explains in detail the architectural technicalities involved in our design. The
various logic blocks designed and simulated are shown pictoria lly and a good insight on
the computational complexities o f various blocks along w ith equations has been
demonstrated. Memory reduction techniques have also been sighted to enhance the
effectiveness o f our work.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4 deals w ith simulation output waveforms o f various important blocks that
have been designed and explained in chapter 3. In addition, various plots that depict the
advantage our architecture and the algorithm it is based on possess over other
architectures are plotted. The chapter also summarizes in mathematical units the effective
power savings and memory access reduction achieved.
Chapter 5 concludes our work w ith a brie f summary o f our results and some
suggestions for future designers depicting the scope for improvements that can be
extended to our work to enhance its effectiveness.
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
FAST FOURIER TRANSFORM ALGO RITHM S AN D ARCHITECTURES
2.1 Introduction
As seen in chapter 1, the Fast Fourier Transform is computationally the most efficient
way o f computing the Fourier transformation. From Section 1.2.3, the calculation o f
direct form o f DFT requires O(N^) and from Section 1.2.4.1, the calculation o f DFT
through FFT algorithm requires only 0(Nlog2N) computations for an A-point input. For
small values o f A, this difference is not significant. But for large values o f A, the FFT is
orders o f magnitude more efficient than direct DFT calculation. Table 2.1 below gives a
comparative analysis o f the computational advantage o f FFT over direct DFT calculation
for various values o f A. A number o f FFT algorithm variants have been designed over the
years each having inherent computational advantages and disadvantages. FFT algorithms
on the broader spectrum are divided into two main types namely Decimation-In-Time
(D IT) and Decimation-In-Frequency (DIF). Either o f the types involves splitting the input
data points into odd and even components, perform DFT operations and then recombine
the calculated values to form the output transformed data points. The D IT form o f FFT
algorithm is formed by splitting x(n) which represents the time domain, into even and odd
components o f N /r data sequence and then continuing the decomposition or splitting
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
operation t i l l r-point DFT sequence is reached, where r is specified in terms o f N as
N = r ' ' , w ith k being a positive integer.
Table2.1: Comparative tabulation o f DFT and FFT computational efficiency
Transform length
m
DFT operations FFT operations DFT operations
-Î-FFT operations
16 256 64 4
128 16,400 896 18
1024 1.05 X 10^ 10,240 102
1,048,576 1.1 X 10^^ 2.1 X 10^ 52,429
The total number o f decomposition stages is given by logrN, w ith the logrN stage
having all its computation as r-point DFT operations. Hence i f r=2, then decomposition
of x(n) would proceed until the stage wherein all DFTs operations or so called butterfly
structure have just 2-points each for computation form ing a rad ix-r (w ith r —2 in this
case) D IT-FFT algorithm. On the other hand, the D IF form o f FFT algorithms are formed
by splitting X(k) which represents the frequency domain, into even and odd components
o f N /r data sequence as in case o f D IT explained above. Hence, this type o f algorithm is
termed as rad ix-r D IF FFT algorithm. On a comparative basis, D IT and D IF are
computationally same, thus enabling using either o f the two forms o f algorithms.
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2 Decimation-In-Time (D IT ) FFT Algorithm
As we saw in the previous section, Decimation-In-Time FFT algorithm fo r N = / , is
derived by splitting A-point input sequence into N /r equal sequences o f even and odd
components o f the input data. For example, i f r=2, then A-point input data sequence
given by x(n) is divided into two N/2 sequences one containing even and the other
containing odd components o f x(n). The equation for output X(k) is given by equation
(1.2.4.1.6) which is restated below for convenience.
X ( t ) = (a )G V ^,) + ( W ^ ) ^ L o ( « W ; : ^ 2 ) (2.2.1)fl=0 a=0
Equation (2.2.1) can be generalized as
%(t) = f / t ) + w;;, FzW (2.2.2)
Splitting X(k) into two A/2 components results in the fo llow ing equations
FzW, t =0, 7, 2.. . , AC2-7 (2.2.3)
%(t+A/2) = F;(t) - F2(t), t =0, 7,2. . . , 1W2-7 (2.2.4)
Equation (2.2.4) has a negative sign as compared to (2.2.3) because o f the fact
thatW^^'^'^ = . A butterfly is a structure that diagrammatically represents equations
(2.2.3) and (2.2.4). Using butterflies to draw flow graphs simplifies the diagrams and
makes them easier to read.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
« X=A+BW
Y=A-BW
Figure2.1: A radix-2 Decimation-In-Time (D IT) butterfly structure
x(0
x(4
x(2
x(6
x(1
x(5
x(3
x(7
3 ><
z xZ X
• X(4)
Figure2.2: Flow graph o f an 8-point D IT FFT structure using butterflies
2.3 Decimation-In-Frequency (DIF) FFT Algorithm
As in the case o f D IT , Decimation-In-Frequency is obtained by splitting the input
sequence into N /r sequences. I f r=2 (say), then the A-point input x(n) is split into two
sequence each o f N/2 points. Unlike in D IT wherein input data is split into even and odd
terms, the D IF just involves splitting the input sequence into N /r sequence. To derive the
algorithm, we begin by splitting the DFT formula into two summations (since r=2), one
o f which involves the sum over the first N/2 data points and the second sum involves the
last N/2 data points [8]. Thus from [4] [5] [ I I ] [12] we can write
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
or
(N/2-1) (V-1)%(Æ)= (2.3.1)
n=0 n=N!2
(A //2 -1 ) (A f/2 -1 ) ^
% ( t ) = E % (»+— )iy ;* (2.3.2)M=0 M=0 ^
Combining the two summations in equation (2.3.2) and using the fact that W,(N/2)kN
= (-1) , we obtain
( A ' / 2-1) AT
% W = %][X») + (-l)*Jc(» + — (2.3.3)n=0 2
Considering the even and odd components o f k and representing them w ith X(2a) and
X(2a+1) respectively, so that
(V /2 -1 ) AT
%(2a)= 2][x(M) + XM + — (2.3.4)n=0 2
{N/2-D AT
% (2 a + i)=M=0 2
a = 0, 7..., (A^-7) (2.3.5)
Thus D IF equations can be generalized for r=2 as
X(k) = F](k) + F 2(k) (2.3.6)
N
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
# X=A+B
Y=(A-B)WB e
Figure2.3: A radix-2 Decimation-In-Frequency (DIF) butterfly structure
X(0)
* & - • X(1)
-1 -1
Figure2.4; Flow graph o f an 8-point D IF FFT structure using butterflies
2.4 Decimation Based on Twiddle Factors
In microprocessor-based system, memory access is expensive mainly due to larger
latency and higher power consumption [ I ] [10]. From figure 2.4, it can be seen that the
various twiddle factors used for the DFT operations depicted by the butterfly structure are
repeated at different stages and even w ith in each stage. For example, the twiddle
factor gets used in stage I as well as stage2. Even w ith in stage2, is used twice.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Hence in this 8-point D IF FFT, is seen to be used for a total o f three times. In terms
o f computation, it involves accessing memory thrice to bring in the same twiddle factor
to perform DFT computation. Thus, the redundancy in twiddle factor memory access is
obvious. For larger values o f N, the redundant memory access becomes substantial
leading to higher power consumption. A unique Twiddle-Factor-Based FFT algorithm
[ I ] , is designed to reduce the frequency o f memory access as well as multiplication
operations. The algorithm is mainly divided into two sections based on the Twiddle
factors that are present. The firs t section named as Super Stage (SS), computes the
butterflies involving tw iddle factors ( /# 0) through a computation scheme sim ilar to
Hoffman coding [ I ] . In this section, all butterflies that use the same twiddle factor are
clustered together and computed, thereby having to load that twiddle factor only once to
compute all the butterflies that use it, instead o f accessing the same twiddle factor each
time a butterfly that uses it needs to be computed, resulting in substantial memory access
reduction. In the second section, (N-1 ) butterflies involving the tw iddle factor are
computed using a top-down tree structure. Simulations proved a 20% reduction in clock
cycles and an average o f 30% reduction in memory access for a 32-point FFT using
Twiddle-Factor-Based FFT algorithm when compared w ith the conventional D IF FFT
algorithm [ I ] [10].
Hence, using the Twiddle-factor-Based FFT algorithm, i f a twiddle factor gets loaded
from the memory, it gets utilized until there is no further need for it in any further
computations. This in terms o f number o f memory access is only (N/2-1) for an A-point
input for Twiddle Factor Based FFT as compared to (N-1) fo r conventional FFT
algorithms. The power saving can be significant using the Twiddle-Factor-Based FFT
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
algorithm especially fo r large values o f N. One main advantage this algorithm has apart
from having lesser memory access is that the second stage dealing w ith does not
involve any multiplication as W °= l. Hence we also have extra power savings through
non-usage o f m ultipliers that are power hungry and computationally intensive.
2.4.1 Decimation Procedure
In case o f Twiddle-Factor-Based FFT algorithm described in [ I ] [10], the butterflies that
are computed at each stage are spread across log^ N stages o f a conventional
decimation. The Twiddle-factor-Based FFT decomposition is shown in figure 2.5. As can
be seen from figure 2.5, butterflies w ith the same twiddle factor (represented by W [x],
w ith X varying from 0 t i l l (N/2-1) and this representation is analogous toWy^), that were
computed at different stages in a conventional algorithm, now gets computed w ith in a
single stage thereby avoiding the necessity to load the same twiddle factor numerous
times as compared to just once in [ I ] [10]. The decomposition o f the algorithm proceeds
in the fo llow ing fashion. For an A-point FFT, the binary index o f a data sample resembles
(AkAk-i.... Ao), w ith k= log 2N -l.
(1) A t the first stage o f decomposition, all data samples o f the form (AkAk-i.... i j are
computed and any two data samples o f the form (AkAk-i.... 1 ) and (AkAk-i.... 1 )
can pair together to form a butterfly [ I ] .
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
gg
hlX{M)
bbiX(9)
liBlLX(l3)
STAGE 1 STAGE 2
jaaX(15)
STAGE 3
SUB-STAGEl SUB-STAGE2 SUB-STAGE3 SUB-STAGE4
STAGE 4
Figure 2.5: Flow graph o f a 16-point FFT structure based on Twiddle-Factor-Based algorithm
The twiddle factor that corresponds to this butterfly is given as where j corresponds
to the decimal value o f the binary sequence given by
(OA1C-1 ....A2A 1 1 ).
(2) A t the second stage o f decomposition, all the data samples w ith binary sequence
(AkAk-].... A 2 10) and (AkAk-i.... A 2A 1 I ) are computed. Any two data samples w ith
binary sequence (AkAk-i.... A 2 IO) and (AkAk-]....A 2 lO ), or (AkAk-i....A2A ] l ) and
{AkAk-i A 2A 1 I ) can pair together to form a butterfly. The tw iddle factor
corresponding to this butterfly is where j corresponds to the decimal value o f
the binary sequence given by (OAt-y .-Az 1 0 ).
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(3) W ith in log2N - l stages o f decomposition, all data samples w ith twiddle factor
other than are calculated.
(4) The stage involves butterflies whose corresponding tw iddle factor value is
(000... 1).
The obvious advantage that can be seen from the above form o f FFT decomposition is the
number o f memory access that is made w ith regards to accessing the various twiddle
factors. Thus we only need (N/2-1) memory access as against (N-1) required by
conventional algorithms [1] [10]. Another interesting aspect in addition to reducing the
memory access is that the log2N‘ stage o f decomposition involves as its twiddle
factor and =1. Thus there is no m ultiplication involved w ith butterflies using this
tw iddle factor and hence the fina l stage decomposition involving (N-1) butterflies does
not involve any multiplication thereby helping us save valuable m ultiplication operations.
Thus, the Twiddle factor based FFT algorithm seems to be a more appropriate
algorithm that caters to the need for power aware FFT systems.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3
AR C H ITEC TU R AL DESIGN AN D IM PLEM EN TATIO N
3.1 Overview
In chapter 2, we had a comprehensive explanation about the Twiddle-Factor-Based FFT
algorithm and also the computational advantages it had over other conventional FFT
algorithms was explic itly shown. Designing architecture to map this computationally
challenging but intensely less memory access-involving algorithm is the main aspect o f
our work.
Until the early to mid-90s, low power electronics were for the major part considered
only fo r a very few applications largely comprising o f small personal battery-powered
devices. But the m id 90s saw the remarkable development o f CMOS sub-micron
technology, which subsequently led to the advent o f deep sub-micron CMOS era.
Another radical change that revolutionized the electronics market was the unprecedented
demand and subsequent development o f portable communication and computational
devices that m ainly depended on battery power for their operation. Unfortunately, the
battery industry could not keep pace w ith the developments in the semiconductor
industry. As a result, high-end electronic portable devices needed constant battery
recharging, which made them less user friendly. This resulted in extensive research done
towards design and implementation o f newer V LS I algorithms, architectures and circuit
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
techniques that would utilize the minimum possible power w ithout sacrificing any o f the
other parameters such as bandwidth, clock speed, area and throughput.
Our architectural design, based on a power reduced FFT algorithm is a small step in
this direction.
3.2 Design Target
The FFT algorithms have a wide range o f signal processing and communication
transmission applications. In recent times, one important area where FFT has found
extensive application is H igh Defin ition Television (HDTV). According to the standard
o f European D igita l Broadcasting, FFT/IFFT must execute 8192 points in 896
microseconds. In addition, we also target our architecture towards Orthogonal Frequency
D ivision M ultip lexing (OFDM) transceiver whose IEEE 802.1 Ig standard requires it to
execute 7024-point FFT in 51 microseconds.
3.3 A lgorithm Setup
We recall the Twiddle Factor Based FFT decomposition structure shown in figure 2.5.
From the structure, the algorithm can be broadly classified into three distinct divisions
namely Input, Processing and Output stages.
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
x(I4ix(ia
W2)
JBuX(10)
jm iX (14 )
gg
imX(13)
dmX(ll)
jmilX(15)
STAGE 1 STAGE2 STAGES
SUB-STAGEl SUB-STAGE2 SUB-STAGE3 SUB-STAGE4
STAGE 4
Figure 3.1: F low graph o f a 16-point FFT structure based on Twiddle-Factor-Based algorithm
A t the Input stage, complex form data values corresponding to the N unprocessed, in
other words time domain signals x(n) where n=0,l,2 ,... ,(N-1) are input into the memory
device from external source (generally other blocks whose output need FFT processing).
The address location where the input data values would get stored depends on values o f
n. That is, i f data value corresponding to x(n) needs to be stored, it gets stored at a
location whose address is given by binary equivalent o f (n+ 1 ) , that is i f data
corresponding to x(0 ) needs to be stored, it gets stored in a memory location whose
address is binary equivalent o f (n+1 ) or (0+1 =1 ). S im ilarly for x(4), the address is binary
equivalent o f (4+1=5). The number o f binary bits to represent memory address depends
on the size o f the memory device as shall be seen in the coming sections. Thus, data is
fed into memory starting from x(0) all the way up to x(N -l), in subsequent locations.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The proeessing division is where the DFT butterfly operation takes place. This division is
equivalent to the Central Processing U nit o f any computer system. This block transforms
input time domain signals x(n) into its corresponding frequency domain signal X(k).
The transformed X(k) output values are written back to the same locations in the
memory device from where they were in itia lly accessed. Thus after all operations, the
same memory locations which contained time domain signals would now contain
frequency domain signal output. The output division accomplishes this process. For any
given N, the total number o f decomposition stages is given as log,. A , where A = / , k
being any positive integer. Hence the above stated process is repeated log,. A times
resulting in the final values o f frequency domain X(k) signals.
3.4 Random Access Memory (R AM ) Address Generation Block Design
The functionality o f the R A M address generator is to determine
(1) Total number o f stages o f decomposition
(2) Total number o f DFT butterfly operational groups w ith in each stage based on
twiddle factors
(3) Determine the memory address locations o f the data needed for each o f the DFT
butterfly operation.
The R A M address generator is one o f the few complex operational blocks designed in our
architecture. Based on our observation o f the Twiddle-Factor-Based decimation FFT
algorithm, we can make the fo llow ing conclusions assuming A=2* k is any positive
integer.
(1) Total number o f stages o f decomposition fo r an A-point input = /ogzA
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(2) Only the butterflies in decomposition stages 1 t i l l (log2N-2 ) involve complex
multiplications.
1 3 5 (NIT, - 1 \(3) Butterflies w ith twiddle factors form the (log 2N - l ) ‘ stage
,(LOGN - 2)
and involves only multiplication w ith imaginary term j which can be effectively
implemented by switching or swapping real and imaginary terms o f the output
and inverting the sign term between them, thus avoiding use o f complex
m ultiplier.
For the purpose o f R A M address generation, we came up w ith a pseudo code in C++
which could generate addresses o f all butterfly DFTs based on Twiddle-Factor-Based
FFT algorithm decomposition described in [1] [10]. The Pseudo code consists o f two
parts w ith the first describing the decomposition for the first (ZogzA-i) stages and the
second describing the (log2N ) ^ stage. The first Pseudo code depicting address generation
o f the first (ZogzA-i) stages consists o f four nested loops.
(1) Loop 1: corresponds to the current stage number (between 1 and (logzN -l))
(2) Loop 2: corresponds to the total number o f different twiddle factors that are used
with in a given stage
(3) Loop 3: fo r a given twiddle factor, we have butterfly DFT structures that vary in size.
Loop3 computes total number o f different sizes o f butterflies that utilize a given
twiddle factor.
(4) Loop 4: Computer total number o f butterflies for a given size and a tw iddle factor
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.4.1 Formulae o f Computational Complexity for the Loop Structures
fo r (log2N - l) Stages o f Decomposition
Loop 1 Computation - The loop 1 as seen from section 3.4, computes the total number o f
stages o f decomposition excluding the last stage, which is generally computed separately
as this stage does not involve any tw iddle factor multiplication. Excluding the last stage
o f decomposition, the total number o f stages is (log2N - l) and hence loop l index varies
from 1 to (log2N -l) .
Loop 2 Computation - The generalized formula for the calculation o f Tw iddle factors
that are used in each o f the (log2N - l) stages, where N = / , k being any positive integer is
given as
W \ , where jc= i, 3, 5, 7... [ (N /r^ ') - l] (3.4.1.1)
where cs represents the current stage number
The current stage value (cs) varies from 1 to (log2N -l) . For our architectural design we
assumed r=2. Hence, (3.4.1.1) becomes
W \ , where x=7, 3, J, 7... /(ZW2 7 (3.4.1.2)
= , where x=7, 3, 5, 7... 7(ZW2 ")-77 (3.4.1.3)
The total number o f butterflies in each stage is given as
( 2 " -1 )A)(CS+1)
(3.4.1.4)
Equation (3.4.1.4) gives the total number o f butterfly operations per stage. Thus, for
(log2N - l) stages, the total number o f butterflies is given as
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
log2 N - \ JtJ
I (3 4.1.5)CS=1 ^
Using equation (3.4.1.3) and the values N=16 and r=2, we can work out an example to
calculate the tw iddle factors used at various stages o f decomposition using the Twiddle-
Factor-based FFT algorithm. Number o f stages w ith tw iddle factor multiplications is
given as (log2N - l) = (log2 l 6 - l ) =3. Thus the stage index value or CS varies from 1 to 3.
Stage#l:
Tw iddle factors accessed when C 5= l is given as
" , wherez=7, j , 5, 7... /(A /2")-77
= , where x = l, 3, 5 and 7
Therefore the tw iddle factors that are used when C5'=l areW/g W,g Wjg and W j
Stage#2:
Twiddle factors accessed when CS=2 is given as
, where z=7, J, 5, 7... 7(7W2")-77
= ' , where x = la n d 3
Therefore the tw iddle factors that are used when CS=2 are W l and lT,g
Twiddle factors accessed when C5=3 is given as
w;" ' , where x=7, J, 5, 7... 7(A/2")-77
= Vkif' where x = i
Therefore the tw iddle factors that are used when C 5=l is IT,g
Loop 3 Computation - For a given tw iddle factor w ith in any given stage, there can be
DFT butterfly structures o f varying sizes, where size o f a butterfly w ith reference to
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
conventional FFT algorithm can be said to be (N/2^^) where CS corresponds to current
index value o f stage number which varies from 1 to log2N fo r a given value o f A. In case
o f conventional FFT algorithms, all butterflies that were computed w ith in a single stage
were o f the same size. On the other hand, Twiddle-factor-Based FFT algorithm,
computes butterflies based on their tw iddle values. Hence, butterflies that are computed
w ith in a single stage need not necessarily contain butterflies o f the same size. The total
numbers o f different sized butterfly structures that utilize a single twiddle factor at any
given stage o f decomposition equals the value o f the stage o f decomposition where the
butterfly is currently present. For example, IT,g twiddle factor that is present in stage 1,
would have butterflies o f only one size using that twiddle factor as it lies in the first stage.
Similarly, IT,g which lies in stage 3, would have butterflies o f three different sizes. The
different sizes might also be specified in terms o f levels, w ith each level being occupied
by butterflies o f one particular size. Hence three different sizes for a single tw iddle factor
means there are three levels fo r that tw iddle factor. The size o f butterflies varies starting
from A/2 for the first level and successive steps having half the size o f the butterflies in
the previous level. For example, we saw that IT,g had three different sizes o f butterflies.
Hence,
Size o f butterflies present in level 1 = A/2
Size o f butterflies present in level 2 = V2(N/2 ) = N/4
Size o f butterflies present in level 1 = ViiNM) = N / 8
Loop 4 Computation - Having formulated the number o f stages o f decomposition, total
numbers o f twiddle factors per stage and the total number o f levels o f butterflies per
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
twiddle factor. We are le ft w ith calculating the total number o f butterflies to be computed
in each o f the levels. The formula for calculating the total number o f butterflies present
per level is given as For example, again considering tw iddle factorW,g, we can
thus calculate the number o f butterflies in each o f the three levels. Thus
Total number o f butterflies present in level 1 = = 1
Total number o f butterflies present in level 2 = = 2
Total number o f butterflies present in level 3 = 2 ' = 4
An important aspect o f the butterfly decimation is that w ith in a given level, the butterflies
can be computed in any order. S im ilarly for a given twiddle factor w ith in a stage, the
various levels can be computed in any order and so can be done w ith different twiddle
factors w ith in a given stage. Thus the parallelism can be exploited at various levels o f
decomposition. This feature makes the Fast Fourier Transform the most sought after
transformation algorithm in numerous signal processing and communication applications.
3.4.2 Pseudo Code Implementation
Based on the four loop structure calculated fo r the first (log2N - l) decomposition stages
based on Twiddle-Factor-Based FFT algorithm, a C-like pseudo code to implement the
DFT butterfly operations over (log2N - l) stages is shown below in figure (3.2).
bf_size = N/2 LOOP 1:i = 1 to (log2N -l) , i+ + a7 = (2 ^ (W ))
LOOP 2.
bf_size = N/2 a = a la = a + 2 ( j - l ) * a l
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LOOP 3:K = (i-1) to 0, k- -
LOOP 4:m = 1 to (N/(bf_size*2)) b = a + bf_size c[m ] = aa = a + (2 *bf_size)D FT Butterfly operation to be performed
LOOP 4 ENDS a = c [ l ] / 2
bf_size = (bf_size/2 )
LOOP 3 ENDS LOOP 2 ENDS LOOP 1 ENDS
Figure 3.2: Pseudo code for the (log 2N - l) stages decomposition
3.4.3 Architectural Blocks Design and Implementation
For the pseudo code designed in figure (3.2), a logic design at the behavioral level is done
using Very large-scale integrated circuits Hardware Description Language (V H D L) tool.
Aldec Inc., license version o f Active H D L version 6.3 was used entirely to design and
simulate the various architectural blocks. The architectural implementation o f the (log2N-
1) stages o f decomposition based on Twiddle factors is shown in figure (3.3). On
comparing the pseudo code w ith the architectural blocks, we can find that the block
i j j lo c k corresponds to loop l, v^hWe j_b lock and k jb lock correspond to loop2 and loop3
respectively. Due to logical implementation complexity, loop4 is implemented using
three blocks namely m jblock, m ljb lo c k and m2_block. The main functionality o f the
R A M address generator as stated earlier is to generate address o f data elements present in
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the memory unit that are needed for all the D FT butterfly operations over the entire
decomposition stages. For every butterfly operation, two data elements are required,
which implies that two addresses are required to be generated by the R A M address
generator fo r every butterfly operations that gets computed. From the architectural
design, qa and qb represent the two binary address values that get generated. Each o f the
two outputs is a binary sequence o f w idth determined by the size o f the memory unit it
accesses as can be seen in the upcoming sections. A ll the blocks are o f non-pipelined
nature wherein only one block remains operational at any instant. The blocks gets
initiated one after the other in the order required automatically by triggering signals
which present w ith in one block trigger the next after the completion o f execution by
current block. There are six blocks in total implementing the four nested fo r loops. The
value o f qa is generated at multiple locations along the six blocks. The correct value o f qa
that corresponds to any particular butterfly operation is chosen among the different
values based on a resolution function block. The alphabet V represents a signal that
toggles fo r every value o f loop l index and in turn triggers j jb lo c k which then gets
executed loop2 index times and V ’ represents the situation wherein jjb lo c k completes
executing loop2 index times and hence control is transferred back to ijb lo c k fo r further
processing. S im ilarly, W and W’ represent the computational triggering signal between
j_b lock and k_block. On the same ground, we can explain the functionality o f X, X ’, Y, Z
and Z ’. The complete V H D L implementation o f the architecture described in figure (3.3)
is shown in A P P E N D K . The other important signals that form a part o f our design are
the clear, enable and clock. The clock is basically a synchronizing signal whose complete
functionality is explained in section 4.7. The clear signal is used as an erasing signal
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
which when active high or 1 erases all the values at the output ports and assigns them to
high impedance.
dock
dearenable k blockenable
To e n a b le (logN ) s la g e b lo ck s W"
loop3dockdear q ak
q a
m2 block ml blockq a m R eso lu tion
block—►
Figure 3.3; B lock design representation o f (log2N - l) stages based on Twiddle factor decomposition algorithm
The enable signal is basically to determine i f a block needs to be operational at any given
instant. Any block proceeds w ith its execution only i f enable signal remains active low or
0 , fa iling which, the previous output o f that block is retained irrespective o f changes in its
input.
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.5 Formulae o f Computational Complexity fo r the Loop Structures
for log2]Sf Stage o f Decomposition
In section 3.4, we saw the R A M Address generation block design fo r the first (log2N -l)
stages o f decomposition based on tw iddle factors. The final stage or log2N‘ stage o f
decomposition is probably unique when compared to the other stages mainly in terms of
usage o f multipliers. The log2N ’ stage is classified in such a way that all the butterflies in
this stage irrespective o f their levels or sizes use only the tw iddle f a c t o r . The value of
is always equal to 1, irrespective o f the value o f N. DFT butterfly calculations based
on equations (2.3.6) and (2.3.7) reveal that using is equivalent to m ultiplying
equation (2.3.7) by 1. The tw iddle factors used for multiplication in the first (log2N -l)
stages are complex in nature (containing real and imaginary terms). Hence when these
complex twiddles get into equation (2.3.7), they result in complex multiplications, which
are computationally complex and intensive to design and implement. Hence the absence
o f complex multiplications in the (log2N"') stage makes this stage computationally less
intensive and results in large savings in terms o f clock cycles for computation and also in
terms o f power dissipation. The total number o f butterflies that are computed in this stage
is given as (N-1 ). Since there is no multiplication by twiddle factor involved in this stage,
we can disable all the multipliers that are designed to handle multiplications. This
procedure is more complex in other conventional FFT algorithms. Though W° factor is
present even in other FFT algorithms, they get utilized at different stages o f
decomposition, unlike in Twiddle factor based algorithm [1] wherein gets utilized
only in one stage. Hence in conventional algorithms, disabling the multipliers at different
time intervals is more d ifficu lt. This is one obvious advantage Twiddle-factor-based FFT
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
algorithm has over other FFT algorithms. As a result, we can disable the multipliers for
clock cycles equivalent to implementing (N-1 ) complex multiplications. This results in
power savings as digital circuits like memory units and multipliers are power hungry.
Though power efficient m ultipliers are being designed, it is more suitable to reduce the
usage o f multipliers.
3.5.1 Memory Reduction Technique
The Twiddle-Factor-Based FFT algorithm [1] [10], describes a methodology by which
memory access can be reduced even at the (log2N) stage. From figure 3.1, butterflies
w ith in the (log2N) stage is decomposed into (log2N) further sub-stages as depicted in
figure 3.4. From figure 3.4 it can be seen that in sub-stage 1 (S I), the two outputs o f
butterfly are represented as A and B. The output value A serves as input for C and that o f
B used as input for D present in sub-stage2 (S2), while input values for points E and F do
not depend on their previous stages. Hence input to these points has to be accessed
directly from memory units. It becomes redundant i f A and B get stored in the relatively
slow memory units and then C and D accessing those values back from the same memory
unit. Instead, A and B can be stored in temporary registers from where C and D can
access them.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
JCC12)
XC13)
%(3)«11)«7)«13)
S U B -S T A G E SI S U B -S T A G E S2 S U B -S T A G E S S U B -S T A G E 4
S T A G E 4
Figure 3.4: (log2N f^ stage decomposition structure fo r # = 16
W hile one o f the inputs to every butterfly can thus be accessed from temporary registers
where the previous stage outputs get stored, the second input to the butterflies is only
accessed from the main memory units. As a result o f this, main memory accessing gets
reduced by as much as 50%. Accessing data from main memory units is more power
consuming than accessing them from temporary registers simply because, temporary
registers can be built according to user needs and can be placed closer to the processing
unit. Moreover the main memory structure is designed in a more complex way when
compared to the temporary register bank. The decoding for temporary register location is
also simple when compared to the main memory. Based on our evaluation, the total
number o f temporary registers required for an A-point input is N/2.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.5.2 Pseudo Code Implementation
A C-like pseudo code fo r implementation o f (log2N) stage is given in figure (3.5). It is a
two for loop structure generating the addresses qa and qh that represent the binary
equivalent o f the addresses o f the two memory locations from where data fo r computing
the butterfly is accessed. This pseudo code only deals w ith address generation and not
w ith memory reduction concept.
bf_size = N/2 LOOP 1:k = 1 to (log2N), k+ + a l = 0 ; a= 0
LOOP 2:m = 1 to (N/(bf_size*2)), m ++ a = a + a l b = a + bf_size c = a; d = b; c = (a + b) d = (a — b) a l = (bf__size) * 2
LOOP 2 ENDS bf_size = (bf_size/2 )
LOOP 1 ENDS
Figure 3.5: Pseudo code for the (log2N - l) stages decomposition
3.5.3 Architectural Blocks Design and Implementation
The two-loop pseudo code is logically designed using V H D L at the behavioral level. The
blocks are designed in such a way that o f the two data values that are required for
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
computation o f every butterfly, one data comes from the temporary register while the
other comes from main memory unit. On one hand, the outputs from the butterflies that
are computed in the (log2N - l) sub-stages are stored back into temporary registers while
on the other hand, the outputs o f butterflies computed in the (logzN/^ sub-stage get stored
into the main memory unit rather than in the temporary registers mainly because, the
(log2N f^ sub-stage forms the final stage o f computation and hence it forms the output
stage and the output values need to be written back onto main memory from where they
would be accessed by other systems. Figure (3.6) shows the architectural blocks o f the
(log2Nf'^ stage implementation. As explained in section 3.4.3, the functionality o f A, A ’,
B, B ’, C, C ’ , D and D ’, is to trigger the successive blocks to which they are respectively
connected. The clear, enable and clock signals perform the same functionality as that
explained in section 3.4.3. The q_temp_reg_addr is the address corresponding to the
temporary register location from wherein data needs to be accessed. The q b jo g n
represents the second address needed fo r performing the butterfly operation, which is
fetched from R AM . The V H D L block implementation o f the architecture explained in
figure (3.6) is given in APPENDIX. On the design front, we plan on designing a 1024-
point FFT architecture targeting H D TV application. For this purpose, we need 512
temporary registers each o f 32-bits wide. The 512, 32-bit temporary registers are
implemented as a tree structure. In itia lly , one 32-bit temporary register is designed. It is
then duplicated or in V H D L terms port mapped to form the second temporary register.
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
clockc le a r
d ockc le a r
clockclear
en ab leq b jo g n 1 q je m p _ r e g _ s d q n
From (logN-1) s ta g e s block
q je m p _ re g _ a d d r
clockclear
clockclear
iogn_block5enatde en ab le
q b jo g n 2 q je m p _ re g _ a d d r2
logn_block3
lo g n _ b lo c k 4
q a j o g n
Figure 3.6: V H D L Block implementation o f (log2N f stage based on Twiddle factor decomposition algorithm
Now, the two identical registers are considered to be a single entity and this is then port-
mapped to form two more identical registers, resulting in four registers. Proceeding this
way, we get to design 512, 32-bit temporary register bank. During the different stages o f
algorithm execution, data is constantly written onto and read out from the temporary
registers. Hence, there is a necessity to generate the address o f the temporary register
where data is to be currently written or to be read. The logic block implemented as shown
in figure 3.6 generates the address corresponding to the right temporary register. This
address is decoded in stages by a decoder. Each stage o f temporary register design has its
own decoder, which decided the exact location o f the temporary register.
The number o f address bits needed to specify a location in the temporary register bank
is determined using the fo llow ing expression
Number o f address bits = log2 (total number o f register locations) (3.5.3.1)
Hence, in case o f temporary register bank, we have N/2 or 512 register locations.
Substituting in equation (3.5.3.1), the number o f address bits is 9. Thus we need 9-bit
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
binary sequence to represent each o f the temporary register locations. In the previous
section, we saw that in the (log2N f^ stage, data is accessed both from main memory unit
as well as from temporary registers. To achieve this, the design incorporates multiplexers
to choose between the two data source. Each temporary register has a Read and Write
signal. I f the Read signal is set to high or 1, data is read out o f temporary register, while
making Write signal high on the other hand, enables us to write data into the registers.
The complete block architecture for address generation fo r both (log2N - l) stages and
(log2N f ' stage is shown in figure (3.7).
LfelotSc I *mrnbW k Wof*
To (logN)Wqe Wock* " 1 ^ ' V'
Ioop1 loops
H l.,v
......1.....
r . * * » mWock
X'
qak
qam R w o W o r
qbJogN-1
clock dockcW r
k>gfi bfodf.1 A ...► kgn.M ocia
en»Meg J b g n .b b c K l
LO'
LRwoAAor
Wochfog #0*2
qb
Address to temp register
W logn
Figure 3.7: B lock implementation o f (log2N - l) stages and (log2N f^ stage based onTwiddle factor decomposition algorithm
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.6 Random Access M emory Design
The Random Access Memory (RAM ) is the main memory system designed and
simulated in our architecture structure. Our FFT architecture is designed for 1024 points
input system. Hence we have 7024-points time domain input signals that need
transformation. For this purpose, the input data needs to be stored in a memory unit from
where they can be accessed whenever necessary. The design and implementation o f the
Random Access Memory (R AM ) tends to serve this purpose.
The biggest advantage that FFT algorithms possesses that other transform algorithms
do not is that o f In-place computation. The FFT algorithms as seen get computed in
stages. In conventional FFT algorithms, input data as well as output data are accesses in
and out o f R AM , which serves as the primary memory unit. The input data that is used in
the computation at each stage is only needed fo r that stage. Once the output to that stage
gets generated, the inputs that this stage used are no more needed and can be replaced by
the output data obtained. This process o f replacing or re-using the same set o f memory
locations that stored the input data to store output data values is termed as In-place
computation. This concept results in m inimum requirement and usage of memory unit
size, which makes it optimal for digital systems. Hence, for an A-point FFT
implementation.
The number o f M ain memory (R AM ) locations needed = N (3.6.1)
I f each memory location is represented by the term word, then, for a 7024-point FFT
implementation, we need 1024 words or locations in the R A M to store all the input and
output values. The address locations for the memory locations start from binary
4 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
equivalent o f 0 and continue all the way up to 1023, thus specifying all the 1024
locations. Each location can store data o f width 52-bits. From equation (3.5.3.1), the
number o f bits requires to address each location can be calculated to be 10. Hence, the
address sequence starts from 0 0 0 0 0 0 0 0 0 0 all the way until 1 1 1 1 1 1 1 1 1 1 w ith increment
o f 1. Two distinct but sim ilar R A M blocks having the same set o f address sequence is
designed to store both the real and imaginary part o f the complex data input o f the form
(a+ jb) w ith a and b being any real number. Thus, accessing the exact memory locations
from both the blocks simultaneously results in accessing both the real and imaginary
parts o f an input data. The basic structure o f a 7-word R A M is shown in figure (3.8). This
structure is common fo r both the real and imaginary parts o f R A M design block.
clock
Enable from decoder
Data to be written — —
Signal to read out data
Signal to write in data
Data to be read
Figure 3.8: B lock level representation o f 1-word R A M cell fo r storing 52-bit data
From figure (3.8), the basic building block o f a R A M word consists o f a Read, Write,
Clear and enable from decoder as its signals. The Read and Write signals enable us to
input data into and output data respectively from any memory location. The Clear signal
i f high or 7 erases the contents o f the memory location and replaces the output as well as
the content o f the memory to high impedance. Only i f the enable from decoder is made
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
active low or 0, w ill it be possible to access data from or into the location. Otherwise,
whatever was the previous output continues to remain at the output port and no changes
get reflected. The structural design methodology followed for R A M is sim ilar to the way
the temporary register bank was designed in section 3.5.3. The V H D L implementation o f
the R A M cell is shown in APPENDIX.
In order to access the exact location in the R A M block, we need to decode the address
b it sequence generated by the R A M address generator. The decoder block implements
this procedure. The decoder is basically a de multiplexer, which chooses one among its
many outputs based on the value o f the input. The basic structure o f a decoder is shown
in figure (3.9).
clock
Decoder enable
2-bli address to choose between 4 outputs
RAM
decoder
output 1
o utgu t2
o u tgu t3
outgut4
Figure 3.9: B lock representation o f a R A M decoder
In our design, we implement one decoder to decode 2 address bits thus enabling us to
locate four R A M word cells. In other words, one decoder can help us operate four
different memory locations. Thus for a 7024-points R A M cell locations, we need 341
decoders in total. This can be made clear by figure (3.10) and the explanation given
below. We know that we need 7 decoder to decode 4 R A M cells and are shown below.
Let this combination be termed as a group. Hence, a group contains 4 RAM cells and 7
4 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
decoder. Based on the previous calculation, i f we have four separate groups, each having 4
RAM cells and 1 decoder as in figure (3.10a), then we got to a total o f 16 RAM cells and 4
decoders decoding these 16 RAM cells.
CELL 4CELLSCELL1 CELL 2
DECODER 1
Figure 3.10a: Basic group formation o f R A M cells using decoders
This is depicted in figure (3.10b). To choose one group among the four available, we need a
decoder. Thus, for a total o f 16 RAM cells decoding, we get to use 5 decoders. Let us term
this as high group.
ij I cai7[ j cEu.*! CEU.2 CELL3 CEU.
DECODER
Figure 3.10b: Hierarchical group formation o f R A M cells using decoders
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Proceeding on a sim ilar ground, we can see that for four such high groups to get decoded,
we need 20 decoders, thus decoding 64 R A M cells. In order to one among the four high
groups, we need a decoder. Hence fo r decoding 64 R A M cells, we need a total o f 21
decoders. The calculation can thus be extended to higher levels o f grouping and thus for a
1024 cells R A M structure, we would need a total o f 341 decoders. The V H D L block
level design o f such a hierarchical R A M and decoder stmcture fo r 64-words R A M is
shown in APPENDIX. The disadvantage o f such huge memory units is obvious from our
previous discussions. The humongous hardware requirements hamper the performance of
such memory units in terms o f operational speed, data retrieval and storage time, and area
and power consumption. Hence in recent times, high-end research is devoted to designing
and fabricating huge memory devices w ith minimum hardware. As a result, high-speed
memories such as cache and flash memories have been designed so as to reduce the
operational delays and also power consumption. These high-end memories are designed
to be very small and hence are area efficient as well. A ll these benefits have created a
tremendous scope for such memories and are widely being utilized in many system
designs especially mobile devices where area and power savings are o f primary
importance.
3.6.1 Read Only Memory (ROM)
In the previous section, we saw the design and implementation o f Random Access
Memory (RAM ). The R A M is a memory unit wherein data can be written, read and
erased. Thus R A M can be termed as an erasable memory. On the other hand, ROM is a
Read only option memory wherein data to be stored is pre-determined during its
fabrication and is hardwired. Thus the data once hardwired cannot be changed and as a
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
result, no new data can be written on to a ROM cell. Hence there is only a Read option
and no Write option. The RO M is thus used only i f there are stored values that do not
change during the course o f execution. In any FFT algorithm, the tw iddle factors , k
=0, 1... ((N /2 )-l) do not change their values once computed. Hence for a given A-point
FFT, the entire N/2 number o f complex tw iddle factors is stored in the ROM, whose size
is determined as N/2. Thus fo r our 1024-point architecture, we need 512 locations in the
ROM to store the various tw iddle factor values w ith each location being 52-bits wide.
Since, there are 512 R O M locations; we need 10-hit binary sequence to address every
ROM location. The address to the RO M blocks are generated by ROM address
generation block, which keeps generating consecutive address locations o f twiddle
factors when ever it gets triggered by a signal from the R A M address generation block.
Every time the second fo r loop in the R A M address generation block gets executed, a
new twiddle factor needs to be accessed from ROM. Hence, a trigger signal is sent to the
ROM address generation block indicating that address for the new tw iddle factor be
generated by it and sent to the ROM blocks which w ill then output the tw iddle factor
values to the location o f the butterfly operation. Figure (3.11) below shows the ROM
blocks designed in our architecture. V H D L block representation o f RO M is shown in
APPENDIX.
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
dock p,
dear ^
ROM con toiler enable
Signal to initiate ROM from RAM address genaratof blocks
KUIV1 access address
ROMContrôler
clock ^ dear ^
|p.
512-wordsROM
Triggeringsignal to output
new twiddle factor
Twiddle factor data t>-
Figure 3.11: Block representation o f Read Only Memory (ROM) and its controller
3.7 Data-Path Design
W ith the address generation block, Random Access Memory and Read Only Memory
design complete, the architecture can now generate the addresses o f the locations from
where time domain signals stored in the Random Access Memory and twiddle factors
from the Read Only Memory can be accessed. Once the time domain signal and
corresponding twiddle factor data from the memory units get accessed, they must be
transformed into frequency domain output signal. In other words, DPT butterfly
operation needs to be performed on the accessed input data. For this purpose, we
designed the data-path block. It is in this block that arithmetic operations such as
addition, subtraction and m ultiplication are performed.
The formulation for each o f the DFT butterfly operations in the Twiddle-Factor-Based
FFT algorithm [1] is based on equations (2.3.6) and (2.3.7) and may be recalled for
clarity.
%(ik) = F # ) + F2W (3.7.1)
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
k= 0 , l,2 . . . ( N /2 - l ) (3.7.2)
Equations (3.7.1) and (3.7.2) involve one addition, one subtraction and one
multiplication. The (log 2N) stages Twiddle-Factor FFT algorithm [1] can be divided into
three distinct groups based on twiddle factors.
Group 1: consists o f butterflies from stages 1 t i l l (log2N-2 ). The tw iddle factors that are
utilized in any o f these stages are complex in nature, which is o f the form (a+ jb) w ith a, b
being any real number. The input data F j (k) and F 2 (k) stored in the memory units are
also complex in nature as the twiddle factors itself. Hence, when equations (3.7.1) and
(3.7.2) get computed, we need to perform one addition, one subtraction and one complex
multiplication which are computationally more intensive than a real multiplication.
Group2: consists o f butterflies in the (log2N -l)^ ‘ stage. The twiddle factor that gets
utilized in this stage is o f value 7= . Hence, when equations (3.7.1) and (3.7.2) get
computed, we need to perform one addition, one subtraction and one triv ia l multiplication
w ith just the imaginary term j . We term this multiplication triv ia l because m ultiplication
w ith j can be easily performed by just swapping the input real and imaginary terms and
invert the sign between the two terms. Thus, there are no m ultiplication operations
actually involved. Hence all multipliers designed can be disabled.
Group3: consists o f butterflies in the (log2N) stage. The twiddle factor that gets utilized
in this stage is whose value is always 1 irrespective o f the value o f N. Hence, when
equations (3.7.1) and (3.7.2) get computed, we need to perform just one addition and one
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
subtraction and there are no multiplications involved in this group. Hence all multipliers
designed can be disabled.
From the above three groups, we can find the addition and subtraction operations to be
common, while in case o f m ultiplication, only group 1 involves complex multiplication
and the other two groups do not involve any multiplication. Hence, exception for addition
and subtraction operations, we need three separate methods, by which we can choose
between performing complex multiplications, swapping operations and no
multiplications.
3.7.1 Complex M u ltip lie r Implementation
As seen from section 3.7, butterflies computed in group 1 have complex multiplications.
Each o f the butterfly operation thus involves one complex multiplication. Complex
multiplications are more complicated when compared to real multiplications as they
contain two terms real and imaginary. Hence complex multiplication involves more
computations than an ordinary real multiplication. There are two methods for
implementing complex multiplications in digital system. From equation (3.7.2) complex
multiplication takes place between F 2 (k) a n d W j, where k=0, 1..., ((N /2)-l). I f complex
F 2 (k) is considered to be o f the form (a+ jb) and the complex term is considered to be
o f the form (c+ jd) w ith a, b, c, d be any real numbers. The complex m ultiplication now
becomes
(3.7.1.1)
= ac+jad+jbc-bd (3.7.1.2)
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Equation (3.7.1.2) can be implemented in two methods.
Method 1: This method is a straightforward implementation o f equation (3.7.1.2).
Equation (3.7.1.2) has two addition, one subtraction and four real m ultiplication
operations. The terms ac and bd represent the real term o f complex multiplication while
ad and be represent the imaginary part. Pictorial representation o f the multiplication
method 1 is given below in figure (3.12).
Method2: This method implements equation (3.7.1.2) using three additions, two
subtractions and three real multiplications [17]. In this method, the number o f real
multiplications gets reduced at the cost o f increase in the number o f additions and
subtractions. Pictorial representation o f the multiplication method2 is given below in
figure (3.13).
Figure 3.12; Complex multiplication implementation using method 1
O f the two above discussed methods o f complex multiplication implementation, method2
has an obvious advantage in terms o f number o f real multiplications. But the trade-off is
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
an increase in the number o f addition and subtraction operations. In general digital circuit
design terms, a m ultip lie r design and layout is on the higher side in terms o f area and
power consumption. It consumes more clock cycles for its execution when compared to
an adder or subtraction circuitry. Taking these factors into consideration, method2 of
implementing complex multiplication is better when compared to method 1. Hence our
design incorporates method2 for implementing complex multipliers.
ac-bd
Figure 3.13: Complex m ultiplication implementation using method 2
From [9], it is seen that the implementation o f a real m ultip lier involves three major steps
namely Booth encoding. Partial Product reduction and Carry propagate addition. The
purpose o f these three steps in order specified is to reduce the number o f computations on
the multiplication operations. The partial product reduction is based on Wallace tree
structure. Complete description about the three steps is mentioned in detail in [9].
Assuming each o f the three steps takes one clock cycle to execute, it takes three complete
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
clock cycles fo r implementing one real multiplication. This is depicted in figure (3.14).
The complex multiplication using methodZ can be implemented using 3 real
multiplications, 3 real additions and 2 real subtractions. The complete block level
implementation is shown in figure (3.15). The implementation is seen to consist o f 5
stages, w ith each stage consuming one clock cycle.
Booth Partial Product Carry PropagateEncoding Reduction Addition
Figure 3.14: Real multiplication implementation stages.
tmag part of FFT output
R eel part o f FFT ou tpu t
S ubi
CPA
CPA
C PA
EncodingM uit2
En coding _ M ulti
PartialP roduct
r e d u c t io n
P a r tia lProduct
reduction
PartialProduct
reduction
stage 1 Stage2 stages StageA S tag eS
Figure 3.15: Complex multiplication implementation using method 2
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.7.2 Group 1 Design and Implementation
In the previous section, we saw two ways by which complex multiplication could be
implemented. We also considered one o f the two methods to be more suitable fo r our
design. Having described the logic implementation methodology, we need to now see the
V H D L implementation o f the various blocks that make up data-path for group 1
butterflies described in section 3.7. Based on equations (3.7.1) and (3.7.2) we can design
our data-path. Equation (3.7.1) involves addition o f the two input data F i(k) and F 2 (k).
Since F j(k) and F 2 (k) are both complex in nature, we need to add the real and imaginary
parts o f the two data separately to satisfy equation (3.7.1). Hence we need two real adders
to implement equation (3.7.1). Equation (3.7.2) can be split into two parts w ith the first
being (F i(k}-F 2 (k)) and the second being complex m ultiplication w ith . Hence, the
second equation involves one subtraction and one complex multiplication. Due to the
complex nature o f data, we need to have two separate subtracters one each fo r real and
imaginary term. The block diagram depicting the data-path for group 1 as designed in our
architecture is shown in figure (3.16).
3.7.3 Group2 Design and Implementation
We know that the tw iddle factor that gets utilized in this group is o f value j . Based on
equations (3.7.1) and (3.7.2) we can design our data-path for group2. Equation (3.7.1) is
same as that for group 1 butterflies. Hence we can retain the same adders and subtracters
used for group 1 implementation. Equation (3.7.2) can be split into two parts w ith the first
being (F i(k)-F 2 (k)) and the second being multiplication w ith j . As discussed earlier, just
swapping the real and imaginary terms and inverting the sign between the swapped terms
can achieve m ultip ly ing w ith imaginary term j. Hence, the multipliers are disabled
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a_reaLoutputa real
b realajmagiiia ry_outpul
a imao
EncodingMuM2
b_real_output
Twkidia tactor real
T *klcHe_faclor Imag
CPA
CPA
CPA
Adds
Sub1
Add1
Add 2
Sub3
EncodingMulti
EncodingMuH3
P a r t ia l...
ProductfeducUon
Partial Product j
reduction
PartialProduct
reduction
Figure 3.16: B lock level Data-path for group 1 fo r tw iddle factor based architecture
whenever butterflies o f this group get executed. O f the 32-bits o f data, the first b it
represents the sign bit and is 1 for negative numbers and 0 fo r positive numbers. Once
subtraction operation takes place, we use two registers to swap the real and the imaginary
terms. The two registers get activated only when group2 butterflies get computed. Thus
data-path for group2 utilizes the same adders and subtractors that were used for group 1
computation. It only requires two new registers to swap real and imaginary data. The
block diagram depicting the data-path for group2 as designed in our architecture is shown
in figure (3.17)
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a real
boreal
a ..imaa.j 1
b jm ag
Addi
Add 2
a_r0al_ou1put
aJmaginary„oytput ►-
a realb_reaLoiJtputSubi
b real
8 im ag^i " g Sub2
b J m a g ! ______
Figure 3.17: B lock level Data-path fo r group2 fo r twiddle factor based architecture
3.7.4 Group3 Design and Implementation
We know that the tw iddle factor that gets utilized in this group is o f value 1. Based on
equations (3.7.1) and (3.7.2) we can design our data-path for group 3. Equation (3.7.1) is
same as that fo r group 1 and group2 butterflies. Hence we can retain the same adders and
subtracters used fo r group land 2 implementation. Equation (3.7.2) can be split into two
parts with the firs t being (F ](k)-F 2 (k)) and the second being multiplication w ith i . As
discussed earlier, m ultip ly ing w ith 1 can be just ignored. Hence equation (3.7.2) just
involves two real subtractions one each fo r real and imaginary term. Hence fo r group3
butterfly computations, we do not need to design any further blocks. We only disable the
multipliers used in group 1 and the two swapping registers used in group2. The block
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
diagram depicting the data-path for group3 as designed in our architecture is shown in
figure (3.18)
a realW Add1
b real
a_real_output
a irnaq _AddZ
b imag
a real
ajmaginary_output
b realSub1 b real output
a imag — bJmag_outpul Sub2 -------------------------------------------------------------------------------------------------------- ►
bJmag I-----------
Figure 3.18: Block level Data-path fo r group3 fo r twiddle factor based architecture
3.8 Summary
In this chapter, we summarize the various aspects o f design o f the main blocks o f the
architecture we designed based on Twiddle-Factor-Based FFT algorithm [1]. The main
blocks include the Random Access ^ e m o ry Address generator. Read Only Memory
Address generator, the 1024-words Random Access Memory (RAM ), 572-words Read
Only Memory (ROM), Address decoders. Temporary register bank and Data-path. A few
multiplexers are also designed and used in the architecture mainly used to regulate the
flow o f various signals across the blocks. Signal clarifications are done using Resolution
function blocks that are designed at necessary locations. This is done mainly because few
o f the signals have either multiple sources or destination thereby necessitating resolution
functional blocks to resolve the signal conflicts. The resolution functions subsequently
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
increase the complexity o f the architecture which can be seen from the operational speed
o f the design.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4
ARCHITECTURAL S IM U LA T IO N RESULTS AND DISCUSSION
4.1 Output Simulation Results
From chapter 3, we can obtain the architectural design o f the various blocks that are used
in our proposed FFT processor. A ll the blocks are designed using VH D L. In this chapter,
we get to simulate the various blocks that were previously designed. By simulation, we
obtain the output waveforms fo r the blocks for different input signals under varying
control signal environments. We shall also determine numerical values o f some important
design parameters such as clock frequency, number o f arithmetic operations and hence
number o f arithmetic operators required to meet the target time constraint and the
operational speed o f the designed processor. During this discussion we shall come across
some o f the main advantages as well as drawbacks the design has and also some
important challenges that require special attention for future designers.
4.2 R A M Address Generation B lock Parameters
This is the logic V H D L block designed to generate b it sequences that represent addresses
o f memory locations in the Random Access Memory (RAM ) from where complex data
values needed to perform DFT butterfly operations are accessed. The logic for the
address generation is based on [1] [10] and the pseudo codes for both the (log2N - l) stages
and (log2N f^ stage are given in sections (3.4.2) and (3.5.2) respectively. In blocks that
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
represent the (log2N - l) stages, qa and qb represent the two output ports through which the
address values are output. Both ports output 70-bit binary sequence. For the (log2N f^
stage computation, qaddr and qb form the two ports outputting address values to the
temporary registers and R A M respectively. W hile qaddr outputs 9-bit address, qb outputs a
70-bit address sequence. The difference in the number o f bits between the two ports is
mainly because o f the difference in the size o f the two different memory units. The
enable signal is used to enable or disable the block. Disabling the block (enable signal
= ’ 1’ ) retains the previous values o f the output ports. The other signals that affect the
functioning o f this block are the clear. The clear signals represent the clear function
which when made 7, erases the values at the output ports by assigning them to high
impedance value z. Once the (log2N - l) stages are executed, the blocks corresponding to
(log2N f^ stage get activated. Each o f the blocks designed get activated one after the other
w ith one block triggering the successive block. As a result, the inherent delay w ith in the
R A M address generation block for generating successive address is 2 (20nsec) to 4 (40
nsec) clock cycles (w ith 2 being the dominant delay factor) depending on which loop the
control is currently in and in which loop the next address is to be generated. The inherent
delay is mainly attributed to the non-pipelined nature o f the design. As a result, each loop
w ith in the address generation unit gets initiated by its preceding loop, which then remains
idle t il l all the succeeding loops complete their execution. As a result o f this idle nature,
we encounter more delay than in a pipelined structure wherein every block operates at
every clock cycle.
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3 ROM Address Generation B lock Parameters
The block signifies address generation fo r accessing twiddle factors needed for DFT
butterfly operations. The output port generates the 9-bit address needed to access the
twiddle factors from the various 512 RO M locations. Once the address is output, a
triggering signal initiates the RO M block to generate the data value corresponding to the
address from the ROM address generation block. The real and the imaginary data values
are output from separate RO M address blocks. The clear represents erasing signal and
has the same functionality as described in section 4.2. The ROM block is designed to
generate address without any delay and hence it can generate address every clock cycle as
per the requirement. There is no inherent delay in this block.
4.4 Random Access Memory Parameters
The R A M as described is a 52-bit, 7024-words block. The 70-bit address generated for
data access from R A M is decoded by the ram_decoder, to find the exact location from
where data is to be read out or stored. Once the address gets decoded, data from that
corresponding location gets read out or data gets stored in depending on whether read or
write operation is specified. The read and write signal get activated automatically by the
controller design, which activates the read signal as soon as address gets generated by the
R A M address generator. W hile the read signal is high or 7, write signal remains active
low or 0 and vice-versa. There is no inherent delay in this block and hence once address
is placed for decoding, data from R A M can be accessed w ith in the same clock cycle.
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.5 Read Only Memory B lock Parameters
The ROM, which stores the twiddle factors, is a 52-bit, 572-locations block. The input to
the ROM is a 9-bit address sequence from the RO M address generator block. Once the
address is present at the input port, the data stored in that address location is loaded on to
the output port. The output is a 52-bit tw iddle factor data. As soon as the address from the
ROM memory controller gets generated, the twiddle factor value from the corresponding
location is loaded onto the output port. There is no inherent delay in this block and hence
data can be accessed w ith in one clock cycle (lOnsec).
4.6 Data-Path B lock Parameters
The data-path serves as the arithmetic back bone o f the processor under design. From
sections 3.7.2, 3.7.3 and 3.7.4, data-path seems to vary between the three sections with
some components being common among the three. The addition and subtraction
operations in equations (3.7.1) and (3.7.2) respectively are common while the varying
component is the m ultiplication w ith the tw iddle factor. Figure (3.13) clearly specifies the
various blocks that are designed to minimize the number o f m ultiplication operations.
Assuming the arithmetic blocks getting executed one after the other, one block at a
time, it takes 5 clock cycles (50 nsec) excluding the operation o f loading the data into the
data-path from memory unit and writing them back into memory unit after data-path
operation, which takes one clock cycle each to implement any DFT butterfly arithmetic
operation in group 1.
6 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Group 2 data path does not involve any multiplication as only swapping between real
and imaginary terms are necessary. Hence on the whole it takes 2 elock cycles (20nsec)
excluding the operation o f loading the data into the data-path from memory unit and
writing them back into memory unit after data-path operation, which takes one clock
cycle each to implement any DFT butterfly arithmetic operation in group 2.
Groups data-path is probably the simplest among the three as it only involves
multiplication w ith i , which can just be neglected. Hence other than the addition and
subtraction operations specified in equations (3.7.1) and (3.7.2), no other arithmetic
operations are required. Hence the number o f clock cycle’ s butterflies in this group take
to complete the arithmetic operations is just 1 (10 nsec) excluding the operation o f
loading the data into the data-path from memory unit and writing them back into memory
unit after data-path operation, which takes one clock cycle each to implement any DFT
butterfly arithmetic operation in group 2.
4.7 Clock Generation and Frequency Calculation
As seen in section 3.2, our architecture is designed to target two main applications
namely H D T V and OFDM transceiver. W hile H D TV FFT necessitates 8792-points in
896 microseconds, O FDM transceiver requires 7024-points execution w ith in 57
microseconds. O f the two target tim ings, OFDM transceiver’ s 57micro-seconds is o f the
shortest duration and hence we need to design our architecture to satisfy 7024-point
execution w ith in 57 micro-seconds which i f satisfied would also help achieve 8792-
points FFT execution in 896 micro-seconds. In order to achieve the tim ing target, we
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
need to first determine the clock operational frequency. The clock is basically used to
synchronize the various blocks that are designed and used in the architecture at any given
instant. For example, i f two distinct blocks need to get executed at the same instant, we
need to ensure that they start and end their executions exactly at the same instant and not
at d iffering time intervals. To ensure this, we need a signal that can synchronize all the
blocks o f the architecture. To ensure synchronization, we need to make sure that the
blocks gets enabled or disabled either at the rising or fa lling edge o f the clock pulse. The
clock frequency thus determines how frequently the clock signal rises or falls which
ultimately determines the number o f calculations that can be performed w ith in the
required time frame. On the other hand, we can also determine the clock frequency based
on the number o f operations i f known. In our design, we determine the clock frequency
based on our target tim ing o f 51 microseconds to perform 1024 points FFT operations.
We thus need to determine the number o f operations that are involved in 1024 points FFT
calculations. To determine the clock frequency, we do the fo llow ing calculations.
(1) The (log 2N f^ stage o f operation based on Twiddle Factor decomposition involves
no tw iddle factor multiplications. Assuming a pipelined structure wherein all the
blocks o f a given design have the ability to operate simultaneously and no block
needs to be idle and wait fo r data from its previous block. This enables us to
generate an output every clock cycle. This assumes one clock cycle period for
every butterfly operation through the data-path as described in section 3.7.4.
Total number o f butterflies involved in the (log2N)'^ stage = (N-1 ). (4.7.1)
Hence, total number o f clock cycles needed assuming one clock cycle for every
butterfly operation = (N-1 ) (4.7.2)
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(2) The (logzN -lf^ stage involves butterflies w ith imaginary term ‘j ’ twiddle factor
multiplication.
Total number o f butterflies involved in the (log2N - l f ' stage = ((N /2)-l). Hence,
total number o f clock cycles needed assuming one clock cycle for every butterfly
operation = ((N/2 ) - l ) (4.7.3)
(3) The firs t (log2N-2 ) stages involve complex multiplications.
Total number o f butterflies involved in the (log2N-2 ) stages
(log, N-2)is ^ ( A / 2 ' ' ^ ) ( 2 ' - l ) (4.7.4)
i=l
Hence, total number o f clock cycles needed assuming one clock cycle fo r every
(log, N-2)butterfly operation = ^ (A /2 ^ '^ '^ ) (2 ' -1 ) (4.7.5)
f=l
Substituting N=1024 in (1), (2) and (3), the total number o f butterfly operations involved
= 5120. Hence assuming one clock cycle for every butterfly operation, the total number
o f clock cycles necessary fo r N=1024 is 5120. Since our target tim ing is 51
microseconds, to implement 5120 clock cycles w ith in 51 microseconds, each clock pulse
should be o f time width 0.02 microseconds. Hence the corresponding clock frequency
becomes i 00 Mega-hertz.
4.8 Memory Access Feature
The architecture under design is based on Twiddle-factor FFT algorithm [1] [10], which
claims to reduce the number o f memory access when compared to conventional FFT
algorithms. Since our architecture maps the tw iddle factor based FFT algorithm, it is also
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
expected to reduce the memory access when compared to other architectures targeting
conventional FFT algorithms.
W hile describing the decomposition procedure fo r Twiddle-Factor Based FFT
algorithm, we showed that once a tw iddle factor gets loaded from ROM on to the data
path, no other tw iddle factor is loaded until all the butterflies that use this particular
twiddle factor get computed. In other words, every tw iddle factor gets loaded only once
during the entire architecture implementation. The number o f twiddle factors that are
present in the decomposition o f an N-point FFT is (N/2). Hence based on the above
explanation, the total number o f times the tw iddle factors would be accessed from the
ROM where they are stored is (N/2).
In case o f conventional FFT algorithms namely D IF or D IT FFT algorithms, a
tw iddle factor gets loaded from ROM every time a butterfly gets accessed. The total
number o f butterfly DFT operations that get computed for an A-point input is (N/2)
log2N. Hence for any conventional FFT algorithms, the total number o f times the twiddle
factors would be accessed from the RO M where they are stored is (N/2) log2N.
In terms o f our architectural design, we can see that from loop 2 execution given in
section (3.4.1), the total number o f times loop 2 gets executed is given by
(4 8 1)
where loop l varies from 1 t i l l (log2N - l)
In addition to (4.8.1), the final (log2N)‘ stage, utilizes only W° factor.
Hence the total number o f RO M memory access in our architecture is given as
+ 7 (4.8.1)
where loop l varies from 1 t i l l (log2N - l)
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Hence in terms o f ROM access, our architecture seems to comply w ith the memory
access reduction claim o f [1] [10]. Thus for various values o f N, the variation in ROM
memory access o f our architecture based on [1] [10] as compared to other architectures
based on conventional FFT algorithms like D IT and D IF are in a graphical form in figure
(4.1). In addition to the memory savings obtained through Read Only Memory (ROM),
the twiddle factor based architecture can obtain additional memory savings through lesser
Random Memory Access (R AM ) accessing.
Computation o f every butterfly DFT structure necessitates 4 R A M accesses two o f
which are fo r reading out the input data from R A M and the other two for storing back the
computed result back on to the same locations in the RAM . Hence for any given A-point
input conventional FFT algorithm and hence any architecture based on it, there are
N2A log2 A R A M accesses since there are a total o f — log^ A number o f butterfly
operations each requiring 4 R A M accesses.
Comparison of Read Only Memory (ROM) memory access between Twiddle factor based
architecture and Conventional FFT algorithm based architectures
6 0 0 0 -
5 0 0 0
4 0 0 0
3 0 0 0
2000
1000
y y / / / yNumber of FFT Input data points
- R O M a c c e s s in c o n v e n t io n a l F F T a lg o r i th m b a s e d a r c h i t e c tu r e s
- R O M a c c e s s in tw id d le f a c to r b a s e d a r c h i t e c tu r e
Figure 4.1: Graphical comparison for ROM access between twiddle factor based architecture and conventional D IT or D IF based architectures
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In case o f twiddle factor based architecture, the final stage o f decomposition involves
butterflies utiliz ing as the only tw iddle factor. There are (N-1 ) butterflies in the final
Nstage o f which the first (— - 1) butterflies have only 1 R A M access as all other three
Naccessing is done from the temporary registers. The last (— ) butterflies on the other
hand have only 1 input accessing from R A M while both data at the output o f the butterfly
gets stored back on to the R A M instead o f the temporary registers. This is because these
set o f butterflies form the last stage o f operation and hence the final FFT data result needs
Nto be accessed from RAM . Hence these (— ) butterflies each have 3 R A M accessing.
Hence the total R A M accessing from the last stage o f decomposition is
N 3N(— -1 ) + = (2N - 1). The butterflies in the remaining stages have 4 R A M access as
usual. Hence, total number o f R A M memory access for all stages is given as
N4((— log2 N ) - ( N - 1)) + 2N - 1 = (2N log2 N - 2N + 3). Hence, when compared to the
(2 Nlog2 N) R A M access obtained from conventional D IT or D IF based architectures,
(2A log2 A - 2 A + 3) seems to be a significant reduction o f as much as 25%. This
directly has an impact on power consumption as it can be further reduced. Though
temporary register accessing is still present, it is much faster when compared to the slow
R A M accessing m ainly due to its small sizing and its positioning closer to the data-path
than the RAM . This reduction in R A M memory access is shown pictoria lly in the graph
depicted in figure (4.2).
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Comparison of Random memory Access (RAM)access between Twiddle Factor based architecture and Conventional FFT algorithm
based architectures
„ 21000 I « 180005 8 15000 "8 « 12000 I s' 9000 E E 6000 i E 3000
n
Ær <cT rAr
Number of FFT Input data points
- RAM access in twiddle factor based architecture
- RAM access in conventional FFT algorithm based architectures
Figure 4.2: Graphical comparison for R A M access between tw iddle factor based architecture and conventional D IT or D IF based architectures
4.9 M u ltip lie r Operational Savings Using the Twiddle Factor Based Architecture
As seen from sections (3.7.2), (3.7.3) and (3.7.4), the entire decomposition operation o f
FFT based on Twiddle factors can be classified into three groups, two o f which namely
group2 and group3 do not involve any multiplication operations. As a result, the
multipliers designed and used in the data-path can be disabled while butterflies from
these two groups get computed. In conventional FFT algorithms, butterflies requiring no
twiddle factor multiplications are available in every stage o f decomposition along w ith
other butterflies that do involve multiplications. Hence disabling the multipliers at
different time instances becomes very complex and hence most architectures do not
disable the multipliers at any instant [22], [23]. As a result, irrespective o f whether a
butterfly involves m ultiplication operation or not, the m ultip lier used in the data-path
remains enabled thus consuming unnecessary clock cycles as well as power. In our
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
architecture, the butterflies involving no tw iddle factor multiplications are classified into
separate stages and hence it is easy to disable the m ultip lier across those stages.
Total number o f butterfly operations where m ultip lier remains enabled in conventional
NFFT algorithm based architectures = — log2 N (4.9.1)
Total number o f butterfly operations where m ultip lier remains enabled in Twiddle factor
based FFT architecture = ( - y lo g j N ) - [ ( - y - l ) + (N-1)]
( — log2iV - — - 2 ) 2 = 2
(4.9.2)
Equations (4.9.1) and (4.9.2) give a comparative analysis o f the savings we attain in
terms o f number o f times the multipliers get utilized effectively. A graphical
representation is shown in figure (4.3).
Comparison of Number of times multiplier gets utilized
6000<uD) 50000)o.
4000
3000
2000o«
. Q£3z
1000
Number of FFT Input data points
- Number of times multipliers get utilized in conventional FFT algorithm based architectures
-Number of times multipliers get utilized in Twiddle factor based FFT algorithm based architectures
Figure 4.3; Graphical comparison fo r m ultip lier operation usage between twiddle factor based architecture and conventional D IT or DEF based architectures
7 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.10 Design Issues and Challenges
The previous sections described the computational complexity, operational speed in
terms o f number o f clock cycles and also the inherent delays present in each o f the
architecture blocks. We observed that pipelining a structure could result in a shorter
execution time than a non-pipelined structure as pipelining involves every block
operating at every clock cycle (when one block computes on one set o f data, its preceding
block computes on the next set o f data while the succeeding block works on the previous
set o f data that was earlier fetched. That is i f block (x+1 ) works on data (y+J), then at the
same clock cycle, block x, which precedes block (x+J) works on data (y+2) while the
succeeding block |jc+2 j works on data y). The non-pipelined structure on the other hand,
involves enabling only one block at a time while the succeeding as well as the preceding
blocks remains idle t i l l the current block completes its execution.
In case o f our architectural design, we have managed to pipeline every block except
fo r the R A M Address generation block. The main challenge involved in pipelining this
block is the presence o f nested loop structure (a loop w ith in another loop). The presence
o f nested loops in the address generation logic design makes pipelining a non-pragmatic
and extremely challenging task. Since nested loops involve transfer o f control back from
the outermost loop to the inner most and then all the way back to the outer loop, keeping
track o f the current values o f registers and variables in each loop becomes tedious.
Consequently, we were not able to pipeline R A M Address generation block while all
other blocks could be pipelined. Non-pipelining o f this block resulted in its inherent delay
being apparent while FFT butterflies get computed. From section 4.2, we know that the
delay caused by R A M Address generation block is 2 to 4 clock cycles o f which 2 is the
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
most dominant delay factor. We also saw that all the other blocks designed and
implemented in our architecture do not involve any delay is the most dominant delay
factor. We also saw that all the other blocks designed and implemented in our
architecture do not involve any delay. Hence, we encounter 2-clock pulse delay fo r every
butterfly computed.
NTotal number o f butterflies for an W point FFT = -^ lo g ^ N (4.10.1)
Hence for N = 1024, total number o f butterflies computed = 5120 (4.10.2)
From section 4.7, the clock frequency utilized in our architecture is found to be 100 M Hz.
Hence the tim ing w idth o f each pulse obtained by taking the inverse value o f the clock
frequency = 10 nanoseconds. (4.10.3)
Hence assuming 2-clock cycle in calculating every butterfly, it takes 10240 clock cycles
to calculate all the 5120 butterflies. Since each clock pulse is o f 10 nanoseconds width, it
takes a total o f 102400 nanoseconds or 102.40 microseconds to compute 1024 input FFT
using twiddle factor-based algorithm. Since the R A M Address generation block cannot
generate address every clock cycle, the data-path that basically requires data from those
corresponding address locations to compute the butterflies w ill not be able to do so. Since
the R A M Address generation block could not be pipelined, the data-path follows suit
though it is simple to pipeline this block. Hence in addition to the 2-clock cycle we
encounter due to R A M Address generation block delay, we also have data-path delay
included.
The data-path delay as discussed earlier is determined based on whether complex
multiplication is involved or not. I f complex multiplication is involved, then based on
figure (3.16), we have 5 stages o f computation and hence the delay is 5 clock cycles. In
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
case o f no complex multiplication involvement, then the delay is only 2 clock cycles. In
addition to these delays, there are also delays due to resolution functions that add up to
additional 2 clock cycles fo r every butterfly computation. The total clock cycles required
computing a butterfly involving no complex m ultiplication is given as 6 clock cycles that
includes the R A M address generation delay, data-path delay and additional delays due to
resolution functions.
The total number o f butterflies for N = 1024 which involve complex multiplication is
1534.
Hence, the total number o f clock cycles required computing 3586 butterflies
= 1534 * 6=9204 (4.10.4)
Similarly, the total clock cycles required computing a butterfly involving complex
multiplication is given as 9 clock cycles that includes the R A M address generation delay,
data-path delay and additional delays due to resolution functions.
The total number o f butterflies for N = 1024 which involve complex multiplication is
3586.
Hence, the total number o f clock cycles required computing 3586 butterflies
= 3586*9=32274 (4.10.5)
Combining equations (4.10.4) and (4.10.5), we can determine the total number o f clock
cycles involved in computing 1024 input FFT based on tw iddle factor architecture is
4 I47&
W ith a 10 nanoseconds clock pulse width, the total time taken to execute 1024 points
input FFT based on tw iddle factor architecture is calculated to be 414.78 microseconds.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.11 Architectural Salient Features and Drawbacks
The biggest advantage the tw iddle factor based FFT architecture has over other
conventional D IT or D IF based architectures is in terms o f memory access which
ultimately results in power savings.
(1) Our architectural design has 10 times lesser Read Only Memory (ROM) access
when compared to D IT or D IF based architectures.
(2) The Random Access Memory (R AM ) access is reduced by as much as 25% in our
design as compared to other previous architectures.
(3) Our design has 1.45 times lesser number o f operations taking into account the
total number o f memory accesses and multiplication computations.
(4) Clock gating where unused blocks are disabled in order to reduce unwanted
power consumption while the blocks remain idle is utilized in our design w ith the
help o f resolution blocks that enable and disable different blocks.
(5) As a result o f the reduction in the number o f operations performed, fo r a 1024
point FFT based on tw iddle factor design, power reduction up to 31.25% in terms
o f memory access and arithmetic blocks usage is expected.
The main drawback that this design suffers is in the time duration taken to compute 1024-
point operations. This is mainly attributed to the non-pipelined nature o f our design.
When compared to our in itia l target application o f OFDM transceiver which necessitates
a 7024-point FFT to get computed w ith in 51 microseconds. Hence we can observe that
our architectural design despite its advantages in terms o f computational complexity and
power savings, is 8 times more time consuming in computing 7024-point FFT as required
by the standardized O FDM transceiver.
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5
CONCLUSIONS AN D SUGGESTIONS FOR FUTURE W ORK
5.1 Summary o f W ork
Our entire work focuses on one-to-one mapping o f the Twiddle factor based FFT
algorithm described in [1] [10] on to hardware blocks so as to extract the maximum
benefits derived from the algorithm. Based on the algorithm, various logic blocks were
designed and simulated using VH D L. Memory savings in terms o f memory access have
been mapped on to the blocks from the algorithm. The architecture makes minimum
utilization o f arithmetic hardware blocks, especially the multipliers. Though this
architecture is advantageous in terms o f power savings when compared to other previous
architectures, the major drawback it faces is in terms o f the time it takes to complete its
required execution. This is attributed to the non-pipelined nature o f its design. Thus
7024-points FFT can be computed using the Twiddle factor based FFT architecture in
414.78 microseconds w ith up to 7.45 times lesser number o f operations resulting in
power savings o f up to 31.25% is expected. A ll necessary blocks have been designed and
simulated in Active-H D L 6.3.
5.2 Suggestions for the Future
As seen in Chapter 4, the main factor that delimits the efficient usage o f the twiddle
factor based architecture is the execution time. Consequently, we have explained that
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
pipelining the R A M Address generation block, which automatically enables pipelining o f
other blocks like the data-path is the most important way o f reducing the tim ing problem.
An efficient way o f pipelining nested loop structures need to be developed in order to
reduce the large execution time fo r processors.
It is a well-known fact that pipelining make architecture attain their fu ll efficiency and
help achieve a higher throughput though w ith some in itia l latency. One other important
method to reduce the delay o f operation is to utilize high-speed hardware blocks,
especially the arithmetic blocks like the adders and multipliers. The type o f complex
m ultip lie r used for our data-path is shown in figure (3.13). It can be seen that this type o f
complex m ultip lie r design uses more addition operations than real multiplications. Hence
usage o f fast adders becomes a necessity in-order to maximize efficiency and increase
throughput. From [9], it can be inferred that fo r arithmetic adders involving b it widths o f
24 or higher. Carry Look Ahead (C LA ) adders are probably the fastest when compared to
Ripple Carry Adders (RCA) or Carry Save Adders (CSA) and many others. Hence, it is
highly recommended to design C LA adders while seeking improvement in the current
design. For implementing real multipliers. Array Based multipliers that are w idely and
most commonly used over various logic designs can be utilized.
Power analysis to estimate the power efficiency o f the tw iddle factor based
architecture w ith all the above said improvements implemented should be carried out
using power estimation CAD tools. This would give an accurate picture o f the
effectiveness o f the tw iddle factor based algorithm [1] [10] over conventional D IF or D IT
algorithms.
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Further savings in terms o f memory access, which is the main highlight o f our design,
can be effectively obtained by using smart memories like Cache and flash memories.
These memory units, unlike the Random Access Memory are small in size and are
comparatively much more efficient in terms o f accessing time. As these memories are
smaller in size and are placed closer to the processor than the Ram, which consequently
enables them to be much quicker than conventional large-scale memories. These
memories help in reducing frequent R A M or ROM access and in turn increase the
effective power saving capacity o f our design.
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX
V H D L DESIGN OF VARIOUS ARCH ITECTU RAL BLOCKS’
The V H D L design and implementation o f various Twiddle factor based FFT architectural
blocks described in Chapters 3 and 4 are pictoria lly represented in this Appendix.
A
K *
5 f '
# . ; e %
TTT*§•
Figure A l l : V H D L block representation o f (log2N- l ) stages R A M Address generation block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa and qb
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(Ik elk ena" ena
U1
alier_tw o_w riteJn_tenipjegi-after_eveiY_writeJn_temp_reg^
mx_signal_oul3Jn_from_datapalhS—U2
log_N_stage_signal
N blockl_it*a(ion_of_block2
ck Wock1_sub_bk)ck1_inMw
ck no _o f_s t^s
Wue_Wn_logt_block1_out
log N stage # a l inI Isigal_alter_at_block1Jn
r
new^leg n bleckf
■W-
N Uock2_inifeitioti_of_sub_yoci(1
Wock1_in*aNon_tmWkÊomptMMn_block2 \^e_from _block2
ek
ena
^b_btack1Jnili3ting_block2
new_logn_block2
afl'îr two write in RAM
Ü7'
bk)ck1J#ak)n_in no_of_con^utafai5
eft s ^ W from blocki to blocki
cit sit_block2_itftiËion
ena
no_of_stages_in
signttf_aftef_sub_block2
-newJogn_blockFsub_blQck1
U3
N qa*lr1(5J)
after_eveiy_mte_m_temp_reg c^jogn l
0W_W_M@e_m_temp_reg < _res<Mon_1 elk signal_ffom.sii)_block2_to_sub_b!ock1
cir
ena
mx_ _ai3_n_ ram_datap8ftno_of_cwrptatk)os_in
no_of_stages>
sii)_block2^lnftiaKwi_B5
i_lrom_loyi_Uock1_in
- Ü 5 - NEW LOGN BIO CKI sA t i look:
N qa_bgi2
8ftef_lwo_wrie^ln,RAM qa<kk2ff:0)
block2_mWw_b_sub_bloek1_in qb_logi2
elk qb_fesolitfon_2
ck signd_fro(n_ai)__Wock1_to_blak2
tiMjigTÿ_oüt3_m_lrom_(btap!ih no_of_compttetion_block2_Bi
vakje_frwn_Uock2_in
value_frofn_loÿi_block1 _in
lqa_logn2
new
qbjogn
qaddr(5:0)
U6
qa(ldr(5:0)
qb_bgn
qaddfl_in(5:0)
qaddr2_kt{5:0)
qb_logn1_in
qb_logn2>
(ÿ)_re»lutoi_1_in
qb_fesolufcn_2_ii
-new-
bk)ck2_ïilfetion_fojiè_UiKR _temp_i •' ck Write_temp_fe9 '
ck
ena
mx_signal_oiA3_in_from_d@hyalh
sii)_b!ock2_iràiation_in
value.from_logn_block1Jn
READ m i E RESOLUTION FOR TEMP REG ney,
NEW_LOGN_BlOCK2_cub_blocl( i :(C)ALDEC.Inc ; 2260 Corporate Circle Henderson, NV 89074
RESOUmON_FOR_q
-#R eadJem p_reg-#W riteJem p_reg
Created;
Title:
6/16/2005
No Title
The Design venficafion Company
Figure A -12: V H D L block representation o f the (log2N)'^ stage R A M Address generation block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa_logn2 and q b jo g n and 6-bit Temporary register bank address qaddr
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
clrf?<- cIr clkB elk
readJjm agE>-,read_1Jm ag w rile J J m a g E - w r i le jjm a g U1
mad_1_imag*—I
address from deeoder#-dalajn_lmaginary{31:0)®-
write_1Jmagi
elk dia_out_1Jni8g(31:0)
dr
cs_1_rajg
dala_m_1_mag(31fl)
reBd_1_m g
<— ©BusOutputO(31:0)
ram_celljmag_1
Figure A-I3: V H D L design o f a 7-word R A M cell fo r storing imaginary part o f data.
c IrL cir elk! elk
re a d _ 1 _re a lC - re a d _ 1 je a l write 1 r e a l# - , write 1 real U1
read 1 re a l© —
address_from_decoderl data in
write 1
' elk daia_out_1(31:0)p
■ cir
' cs_1
- cWaJn_1(31;0)
•• r e a d j
•write 1
-D data_output(31:0)
ram cell real 1
Figure A-I4: V H D L design o f a 7-word R A M cell for storing real part o f data
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cIrZclkP'
read_1_real- write 1 reaL
cirelkread_1_rea! write 1 real
de ena 2x4 real
U5^
elk de_oul_2x4_real(3:0) ■
cir
de_ena_2x4_real
de Jnp_2x4_real(1:0)
decoder 2x4 real
BUS1549(0)
BUS 1549(1)
BUS 1549(2)BUS1549(2)
de_lnp_2x4_real(1:0)
Figure A-I5: V H D L design o f a 2x4 Decoder block choosing one among four outputs based on a 2-bit input. The input port is given as de jnp_2x4_rea l(l:0 ) and output ports are represented by de_out_2x4_real(3:0)
U12 rom__designJmag_new
rom_imga_out_real(31:0)
U9
— #^BusOutput1 (31:0)U10
mem_oonli^0^_controlier1 — >
rom_design real new
elk mem_contr_out(5:0)
cir rom jnitiatlon_signal
mem_contr_ena
rom_sig_in
elk
mem_eontr_ena
rom_address(5;0)
rom initiation
signal_from_block3Jnput
rom_real_out(3:0)
elk
mem contr ena
rom_address(5:0)
rom initiation
signal_from_bloek3Jnput
romJmag_out(3:0)
Figure A-I6: V H D L design o f a 64-words ROM and a decoder along w ith interconnects outputting 32-bits real and imaginary tw iddle factors
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
re a d J J m a g ^ V ea d _1 _ im W e _ 1 _ im a g # f W te_1Jnr
BUS11
read_1_itnaglwrite_1Jmagl
d a la _ in jjm a g (3 1 :0 )# —
de_inp_2x4Jmag_2(1:0)l
de_ena_2x4_imag
de_inp_2x4
le_inp_2x4Jmag_1 (1:0)© —
U4_ _ _ _ _ _ _ _ _ _
elk de_oul_2x4_iiiiig(3:0)
dr
de_ena_2x4_lmag
de_inp_2x4_mag(1:0)
-4«US1t7Ç|
IIUS117(2)
BJS117(3)
decoder_2x4Jmag
cfc dahi_CKjl_1_imag(31:0)
cir
data_in_1_ima01:O}
de_ena_2K4_imag
de_inp_2%4_im@g(1i])
de_inp_2x4^imag_1(1fl)
r«8d_1_imag
w te 1 imag
\16 words RAMJmag2\ — #data_oul_1_imag(31:0)
cl data_oj!_1_imag(31;0) ck
de_ena_2K4_fnagds_inp_2x4_ima9(1Æ)
de_inp_2)i4_iïiag_1(1 l)
read_1Jmag
wie.1_»nag
\16 words RAM imag2\ U3
e l d*i_oüt^1_imag(31;0)
ctfdata_in_1_imag(31:0)
de_ena_2%4_mg
de_mp_2x4_mag(1Æ)
de_inp_2»4_BTiag_l(1:0)
read_1Jmag
\16 words RAM imag2\ U5
elk dat!_d td j_iirag(31:0)^
de_@na_2x4_imag
(k_inp_2x4_im@g(1:0)
de_Hip_2x4_imag_1(1fl)
rflad_1Jmag
In}flte Circle ,NV 89074 The Design Verification Company
11/4/2004
No Title
\16 words RAMJmag2\
Figure A-I7: V H D L design o f 64-words R A M using four 76-words R A M cells and a decoder along w ith interconnects outputting 32-bits real and imaginary data
8 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a_realjnp(31:0)
b_realjnp(31:0)
adtH ena
jmagLinp(31:0)
fiag_inp(31;0)
add2 ena
sub1 ena
multi ena
B—I sub2 ena^
mull2_ena
B—adds ena
c_reai(31:0)® -
Isub3 ena ^
U1
a_addripi:0) add1_oi^31:0)
a d d ijn a nK_dM3path_1
b jd d r lp fO l
elk
elf
U2 adder1_ne,v_
a_aM2(3l:0) ailil2.i»jl(3l:0)
atkj2_ena
aiM2(31:[lt
elk
cir
U3adder2 new
i_subr1(31:0) sub1_Qut(31:0|
iju b rip i.O )
elk
elf
subi eng
U4Subtfactor1_new
i_subi2pi:0) siA2_out(31:0)
)jmm)elk
cir
sub2 ena
s # a c to f2 new
multS ena sub4 ena add4 ena adds aiaa_real_o(Jt(31:0)
multiplier1_nw
a_!Tuft1pl;0) multl_out(31:Q)
eï(
cir
multi ena
a_mtA2p1:Q) rnjlt2_out(3!:0)
_n#2pi:0) elk
cir
mult2 ena
buf3_sub4pi.O) sub4_out(31;0)
bul4_stA4pl:0)
elk
cir
sub4 ena
subtractff'l m -
U7 r2 new
a_ad(Jf3pi:0) add3_out|31;0)
add3_ena
)_acfdr3pi:D)
elk
cir
U8 adder3_ne¥.'
:_subi3pi;0) sub3_out(31:0]
c_sutff3pl:D)
elk
cir
subS ena
Jtle:
a_mult3(3!:0) nul3_oiiÿ31:D)
)_mult3pl:0)
elk
elf
multS ena
U11ajmaa_out(31:0)
ad(M_ena add4_(M(3l:0)
buf3_add4pi:0)
buf4_ald4pl:0)
adder4 nen
bjmag_outjroup1(31:0)
bjmag_oiyroup2(31:0)
b real4315-
addSjna add5_ou^l:0)
buf5_add5pi:0| rn x ja la p a th j
elk
cir
stA4_add5(3!;0)
adderS_newJb_real_out_group1(31%
b I
bjinag_oiyroup3(31:0)
(()ALDEC.Inc 2 !60 Corporate Circle H ïnderson, NV 89074 | Qgjg \j f Company
eated: 11 /1 /2 0 0 4
No Title
clki dk
diL di
LDECÜ14-
dk m x jig n a lju t i
cir m<_S!gnal_out2
iTO_datapatli_IJn m<_s!gnal_otiÛ
rn(_dalap#_2jn
inx_si(
—Kmx sii
Cl i h t r o r t r r l nm.'inx_datapatli_resolution_nev/
Figure A-I8: V H D L design block showing the data-paths for group 1, group2 and groupS. The output signals fo r each group are represented by b_real_out_groupX and b_imaginary_out_groupX w ith X representing individual group numbers.
8 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
add2_enalaiM1_ena@U2
cadeim_enaB— ^
fi(Jnp1(31:0)B------ -ix_inp2(31:0)&------ *-
tlk d»m_ouH(31jl|
dr dmx.oul2(31C)
dsrrax.cs deim_oii3(31:0)
deniitr.ena dm .oU4(31:0)
der«iJX_M(3W) demux_inp2(31J)
demultiplexerja_ql)
add3_enal---------------I enal
cjmag(31;0)^c_real(31:0)i>-
addS eti!#-
mull1_enal muKend mull3_enal sub1_enal sub2_enal sub3_enai aib4 end
U1
Ia.imag.it()(3lil) a_irrtag_rrirl|3M)
a je d jp p U ) a jea l.ori(3 IJ)
addt_ena b,inag.mrl(31:0)
add2_erra b.real.ota(3l:0)
add3_sna mxjgtral_oiiM
addd.eria mx_signal_(jrjl2
add5_ena n t* j^ _ o rd 3
b_irra9.B|)(31:0)
b_real_irrp(3W)
tJmag(31J)
c_raal|3t:ll|
dk
dr
mriltl_arta
i!Ë2_«ria
mi#3_etra
snbt_e«a
sub2.ena
5tÈ3_m
snb4_atta
U3
ck dala_miix_(ijt1(31d))
dr data.inrK.oifi(31J)
(telrr_mi»_ts
data_mn(_ens
date_mtK>1(3li))
dala_mrrx_inp2(3lll)
dah_miri(_inp3(3H))
dala_intix_in|)4(3l:0)
rmdtiplexefjWapat
■#dala_niux_out1(31:0)■#data_mux_ou12(31:0)
: ena
lmx_signal_out1
-#mx_sigral_oul3
Oal bcompWe
Figure A-I9: V H D L design block showing the resolution function designed fo r the datapath to choose between the three data-path groups and also for determining the Read and Write operations o f the R A M block and other resolution blocks.
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
d#Ctlr
ig afe datajntojam US
N a_tiiljM_tes#ig_m_m1
c l b_outJoi_resoMng_in_m1
ck qa(5:0)
era qb(5:0j
sigjfler.tla laJnto jan i tom.sig
signal_froniJock3J
signaljmm_blocl(3J
mm_sig#-'
qb(5dl)@-ijknim1ni2_resoiiition_qa_int1ostdlogjcJognstage
# -
a_MiljM_reséirig_m_m1jesttïiiiu*_6ra
c l cs_irax_erajeed
clics_nnix_enajee(tiack
ena
m_datapalk_(mt
lead jS erJom jam
vnJe alec tom lam
Cl imix enaU3
prog_contr_resolytion_niüx_os_ena
U4
" a_oUJor_tesolviiig_m_m1 je c é e muxjna
■■ b_oiit_foijesolmg_in_tii1_t6solï6
- c l
-cil
a_mdJotjesotiing_m_mteaàeWaeJe_iam
e- c l wle_ïalue_lo_ram
■*-cli
*-ena
mx.datapait.ciil
+—#fead_valueja_ t—*wrile vAe lo
ram read writd
-— Ireadjfterjromjam
"write after from ram
ena
Figure A-IIO : V H D L design block showing R A M Address generation block with resolution functions that aid in enabling and disabling various signals at different instances o f time
87
Reproduced witfi permission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout permission.
d#tdr
U1
m ux_clr#f)m ux_clrIt. mem contr-ena-
mem contr ena '
n1m2 eso lu liqnja jiiltostd logicjop i li c
N 8_oul_foi_resokwg_m_m1
ck b_oifl_foM6s<A'mg_m_ml
cir qa(5:0)
ena # : 0 )
yg_after_data_into_ram rom_sig
siÿwl_fiom_block3_1
m-ck rrw(_out(5;0)
iwx_clr
mux_cs_fromj€soMion
mux_8na_ftomjesolutiofi
m ux>p1(5;0j
0:1
m u ltip le x erja jb
datajn_1(31:0)
datajn_1_imag(31:0)
a_outJofjesofkm a.m _m 1jesotem ® _ena
ck cs_mux_ena_f@ed
CËes_mux_ena_fe«fcack
m>;_dat^al}i_out
read_aftef^from_ram
write after ftom ram
prog_contrjesolution_mux_cs_ena
,A(::3),A(4;5
tw-a_wl_for_resolmg_m_m1_reaoke m u x jn a
b_oU_fot_ies(^ng_m_ni1 resotve
c_out_after_two_reads_iam
elk
cir
mx_datapath_out
■ prog_contH@ekrti6it mux enable
tm-
signal_tojnpdl_datajnto_ram_initi%
c_CHjt_f6rjesovlrîg
read_after_for_tesdving
dala_in_1{31;0) sig_aftei_dala_from_fan
dala_in_1_imagf31:0>wie_aft6r_forj6sotving
rmrx_orftput{5:0}
read_1
signaljo_lnput_data_hto_ramjratialf/
mem contr e n a # —
U9
(C)AÎÎB1
U3
a_crtrt_for_resolvittg_m_m!ea*9Wae_toj»ii
ck writejà ie J o _ r m i
ck
ena
mx_datapalh_out
prog_contrjesclution_ramjead_ttrite
^mux_datapath_out
-W-
c k data_outj(31:0;
ckd a ta jn j(3 l:0 )
de_ena_2x4_real
de_bp_2x4.real(1;0)
de_inp_2x4_ieai_1{1:0)
de_kip_2x4_reaL2(1:0)
Ï64 WORDS RAM2i
I.DECCorp
ttendersOT
-W-s$'
Af.j).
’■ Î
c k data_out_1_imBg(31Æ)
ck
data_ii_1Jmag(31:0}
de_ena_2x4_imag
dejnp_2x4_ima9{1:0)
de_lnp_2x4_ima9_1(1:0)
de_B^)_2x4_imag_2i1f))
read_1_imag
wrile_1_imag
data_out_1_lma( (31:0)
de ena 2)
(31:0)
4_imagWORDS RAMJmag2\
de ena 2x4 i
U12
Inc jra te Oifcl , , N V 8 #
elk m6m_conlr_outi5:0}
cli iom_inftiation_signal
mem contr ena
.rom_sig_in
20
ck rom_imag_out{3:0)
mem_contr_Kia
iom_address(5Æ)
rom_kiSinsi?id_from _block3_ii^
U10
ificatio 1 C
i t i e
rom_imag_out(3:0) ' i)ul(3:0)
"—Belk rom_real_out(3:0)
mem_conti_ena
rom_adèessi;5Æ)
romjnitiatton
signalJom_bkick3_inpift
rem-desigiyeal_new
rom_design_imag_new
inemory_conlroller1
Figure A - I l l : V H D L design block showing R A M Address generation block, the R A M blocks, the ROM memory eontroller block, the ROM blocks and all the other interconnects
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A *C«naU1
nnK_dr#t,fnuL*It 'ff ie m corjir ena
mem contr ena
IJkmm 1m2_resolLttioi Lqajnttostdlogicjognstage
ü ttr
3dd1_ena data_mux_Mi1(31:0)
add2_ena data_mii<_(wt2(31:0)
add3_ena TO_signaI_ou{l
adcM_ena fre_signaUut3
addSjna
c.imagplOJ
c_iB3lp1:0)
elk
cir
dat3_rrux_«ra
demi(_cs
derrwx_ena
dema_inplp1:0)
detnix_inp2p1:0)
mult!_ena
rrult2_ena
multSjna
siÈI_ena
si4i2_ena
si*3_ena
s # ena
(tatapathjejo lutionjonip let
N a_out_for_tesoMn9_m_m1
elk b_out_for_iesolving_m_iTi5
cir qa(5;0)
qb(5:D)
fter_data_lnto_i3m rom_sg
signat_Wm_bbck3_!
dk )TiKjut(5:0)
mux_cs_fit)m_resolution
t7ux_enaJrom_resoton
mux_mpl(5:0)
mK_ïç2(5:0) A 5:1
multiplexer_qa_qt:'
mx_signal_out3
Wr3_oiflJor_resolving_m_ml_reso}y8:s_mux_ena
elk cs_mLK_8na_feed
cir
es mux ena feedback
mx_dat^atti_out
read_^er_from_ram
wnte_^er_ffom_ram
rtcfcowtrj olutionjririi'te ^
,A(2:3).A(4:5)
-W-8_ait_for_RSolmg_m_ml_resQlve m jx jn a
b_oul.for_(ESokmg_m_ml_fesrtve
c_out_after_twD_î5ads_ram
elk
cir
mt_d3fapath_out
-profj ot-resolulionj
« -
C—
signalJoJnpptJataJntojamjnitially
c_oul_fcr_resoving
read_after_fof_resolving
i_in_ipi:0) sig_afler_dafajrüfîi_r3m
i_in_f Jmag(3l;0) wte_afterJof_iesoMng
nux_oi5iuq5:0) read, I
s^nalJoJnput_d3ta_into_r3m_initially
lUX
)EC. Inc brporate dirclersqnpNVetW
d: 5/17/21
|U3^a_oulJof_resoMng_m_mt_ie8bJyslue_to_r3rn
ck wite_value_îo_ram
cir
ena
nt<_dat^atb_out
progL,contjesolution_ram_rpad_wri' a
4 3 6 -
A((
elk data_oia_ipi:0)
cir
data_in_1(31:0)
de_ena_2x4_real
de_irç_2x4_real(l;0)
de_inp_2x4_real_l(!:0)
de_inp_2x4_ieal_2(l:0)
,5
Mite_l
\6f| WORDS RAMSU
W-
Üiili i i i lA O I)
elk d3ta_out_tj
cir
data_in_1_imaB(31:
de_en3_2x4jrT0g
np_2x4_knag(1
np_2x4_imag_'
dejnp_2x4_irnag_;
lead.l.lmag
w!te_l_m g
\64 WORDS R
U12
l ie Design Verjl
mem contr i
elk mem_ccmtr.ouq5:G)
cir romJnidation_signal
mem_contr_ena
rom_sig_in
elk ramjmag_ou^3;0)
mem_corW_ena
fDm_3ddress(5:0)
rom^nitiafew
;ignal_fmm_block3_iivutU10
rom jesigi
memory_controller1
lac(Dm_real_out(30)
mem contr enanw ■iôm_address(5:0)
tcm_inibaüon
signal_from_bk)ck3_inpul
_ jd f lc in n r a a l n a u i
Figure A-I12: V H D L design block partia lly showing interconnection between all the blocks o f the Twiddle factor based FFT architecture.
89
Reproduced witfi permission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout permission.
BIBLIO G RAPHY
[1] Yingtao Jiang, T ing Zhou, Yiyan Tang and Yuke Wang , “ Twiddle-Factor-Based FFT
A lgorithm w ith Reduced Memory Access” , In the Proceedings o f the International
Paralle l and D istributed Processing Symposium, IEEE, 2002.
[2] Bevan M.Baas, “ An Approaeh to Low-Power, High-Performance, Fast Fourier
Transform Processor Design” , Ph.D. dissertation, Stanford University, Stanford, CA,
February 1999.
[3] Smith, Steven W ., “ The Scientist and Engineer's Guide to D ig ita l Signal Processing” ,
2nd edition. San Diego; California Technieal Publishing, 1999. ISBN 0-9660176-3-3.
[4] Alan V.Oppenheim and Ronald W.Ssehafer, “ D igita l Signal Proeessing” , Oetober
2000. ISBN-81-203-0532-9
[5] Brown, J. W. and Churchill, R. V., “ Fourier Series and Boundary Value Problems” ,
5th ed. New York: M eG raw -H ill, 1993.
[6] Brigham E.G., “ The Fast Fourier Transform and Its Applications” , Prentice-Hall,
Englewood C liffs, NJ, 1988.
[7] Julius O.Smith III, “ Mathematics o f the Discrete Fourier Transform (DFT), w ith
Music and Audio Applications” , W 3K Publishing, 2003. ISBN 0-9745607-0-7.
[8] John Proakis, D im itris Manolakis, “ D ig ita l Signal Processing - Prineiples, Algorithms
and Applications” , Pearson, ISBN 0133942899.
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[9] Jan M.Rabaey, Anantha Chandrakasan and Borivoje N iko lic, “ D igita l Integrated
C ircuits-A Design Perspective” , Prentice Hall, Second Edition, 2003. ISBN 0-13-
090996-3.
[10] Yiyan Tang, Yingtao Jiang and Yuke Wang, “ Reduce FFT Memory reference for
low power applications” . In IEEE International Conference on Acoustics, Speech and
Signal Processing, 2002, Volume 3, 13-17 M ay 2002 Pages(s).TII-3204- III-3207 vol.3.
[11] Byerly, W. E., “ An Elementary Treatise on Fourier's Series, and Spherical,
Cylindrical, and Ellipsoidal Harmonics, w ith Applications to Problems in Mathematical
Physics” . New York: Dover, 1959.
[12] Carslaw, H. S., “ Introduction to the Theory o f Fourier's Series and Integrals” , 3rd
ed., rev. and enl. New York: Dover, 1950,
[13] Davis, H. F., “ Fourier Series and Orthogonal Functions” , New York: Dover, 1963.
[14] Dym, H. and McKean, H. P. “ Fourier Series and Integrals” , New York: Academic
Press, 1972.
[15] Folland, G. B., “ Fourier Analysis and Its Applications” , Pacific Grove, CA:
Brooks/Cole, 1992.
[16] Groemer, H., “ Geometric Applications o f Fourier Series and Spherical Harmonics” ,
New York: Cambridge University Press, 1996.
[17] Oppenheim A.V., Schafer R.W., and Buck J R., “ Discrete-Time Signal Processing” ,
Prentice-Hall, 1999.
[18] Smith S.W., “ The Scientist and Engineer's Guide to D ig ita l Signal
Processing” , (http://www.dspguide.com/pdfbook.htm), California Technical Publishing,
San Diego, 2nd edition, 1999.
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[19] Retrieved from “ http://en.wikipedia.org/wiki/Discrete_Fourier_transform” .
[20] Sevan M . Baas, "A Low-Power, High-Performance, 1024-point FFT Processor."
IEEE Journal o f Solid-State Circuits (JSSC), pp. 380-387, March 1999.
[21] Sevan M . Baas, "A 9.5mW, 330/tsec, 1024-point FFT Processor," Proceedings o f
the 1998 Custom Integrated Circuits Conference (CICC), Santa Clara, CA, USA, 11-14
M ay 1998.
[22] Wen-Chang Yeh, and Chein-Wei Jen, “ High-Speed and Low-Power Split-Radix
FFT” . In IEEE International Conference on Signal Processing, Vol., 51, No.3, March
2003.
[23] He .S. and Torkelson, “ Designing pipeline FFT processor fo r OFDM
(de)Modulation,” in Proc. IEEE URSI International Symposium o f Signals, System and
Electron, 1998, pp.257-262.
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA
Graduate College University o f Nevada, Las Vegas
Bhaarath Kumar
Home Address:1600, E. Rochelle Ave., Apt 42 Las Vegas, NV-89119.
Degrees:Bachelor o f Engineering, Electronics and Communication Engineering, 2002, University o f Madras, India
Thesis Title:Design and Implementation o f a Fast Fourier Transform Architecture using Twiddle Factor Based Decomposition A lgorithm .
Thesis Examination Committee:Chairperson, Dr. Yingtao Jiang, Ph. D.Committee Member, Dr. Emma Regentova, Ph. D.Committee Member, Dr. Eugene McGaugh, Ph. D.Graduate College Representative, Dr. A jit K.Roy, Ph. D.
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.