Design and implementation of a fast Fourier transform

UNLV Retrospective Theses & Dissertations

1-1-2005

Design and implementation of a fast Fourier transform Design and implementation of a fast Fourier transform

architecture using twiddle factor based decomposition algorithm architecture using twiddle factor based decomposition algorithm

Bhaarath Kumar University of Nevada, Las Vegas

Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds

Repository Citation Repository Citation Kumar, Bhaarath, "Design and implementation of a fast Fourier transform architecture using twiddle factor based decomposition algorithm" (2005). UNLV Retrospective Theses & Dissertations. 1837. http://dx.doi.org/10.25669/uqf7-sxec

This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected].

http://library.unlv.edu/

http://library.unlv.edu/

https://digitalscholarship.unlv.edu/rtds

https://digitalscholarship.unlv.edu/rtds?utm_source=digitalscholarship.unlv.edu%2Frtds%2F1837&utm_medium=PDF&utm_campaign=PDFCoverPages

http://dx.doi.org/10.25669/uqf7-sxec

mailto:[email protected]

DESIGN AN D IM P LEM EN TATIO N OF A FAST FOURIER TRANSFORM

ARCHITECTURE USING T W ID D LE FACTOR BASED

DECOMPOSITION ALG O R ITH M

by

Bhaarath Kumar

Bachelor o f Engineering University o f Madras, India

2002

A thesis submitted in partial fu lfillm ent o f the requirement for the

Master of Science Degree in Electrical Engineering Department of Electrical and Computer Engineering

Howard R. Hughes College of Engineering

Graduate College University of Nevada, Las Vegas

August 2005

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

UMI Number: 1429711

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

UMIUMI Microform 1429711

Copyright 2006 by ProQuest Information and Learning Company.

All rights reserved. This microform edition is protected against

unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company 300 North Zeeb Road

P.O. Box 1346 Ann Arbor, Ml 48106-1346

Reproduced witfi permission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout permission.

m m i.Thesis ApprovalThe Graduate College University of Nevada, Las Vegas

July 12 .2005

The Thesis prepared by

Bhaarath Kumar

Entitled

"Design and Implementation of a Fast Fourier Transform Architecture

_________ Using Twiddle Factor Based Decomposition Algorithm"__________

is approved in partial fulfillment of the requirements for the degree of

Exam

-V.Examination CommiMee Member

G radiM e College Faculty Representative

Master of Science in Electrical Engineering

Examitmtion Cmnmittee Chair

Dean o f the Graduate College

11


ABSTRACT

Design and Implementation of a Fast Fourier Transform Architecture using Twiddle Factor Based decomposition Algorithm

by

Bhaarath Kumar

Dr. Yingtao Jiang, Examination Committee Chair Assistant Professor

Department o f Electrical & Computer Engineering University o f Nevada, Las Vegas

W ith the advent o f signal processing and wireless communication mobile platform

devices, the necessity for data transformation from one form to another becomes an

unavoidable aspect. One such mathematical tool that is w idely used for transforming time

and frequency domain signals is Eourier Transform. Fast Fourier Transform (EFT) is

perhaps the fastest way to achieve transformation. Many algorithms and architectures

have been designed over the years in an attempt to make EFT algorithms more efficient

and to target many applications.

The main objective o f our work is to design, simulate and implement an architecture

based on the Twiddle-Eactor-B ased decomposition EFT algorithm. The significant

feature o f the algorithm is its effective memory access reduction that accounts to be as

much as 30% lesser than in any other conventional EFT algorithms. As a result o f this

memory reduction, this algorithm is said to be more power efficient and is said to

compute in much lesser number o f clock cycles than other algorithms developed.

Ill


LIST OF FIGURES

Figure L I Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components andusing N twiddle factors........................................................................................9

Figure 1.2 Data Flow graph o f an 8-point DFT calculated by splitting the N pointinput into two N/2 parts containing even and odd components andusing only N/2 tw iddle factors..........................................................................10

Figure 1.3 Data Flow graph o f an 8-point FFT, showing the entire decompositionuntil 2-point DFT stage is reached. The stage w ith is the laststage o f decomposition w ith all butterflies being 2-point DFTs.................. 11

Figure 2.1 A radix-2 Decimation-In-Time (D IT) butterfly structure.............................. 18Figure 2.2 Flow graph o f an 8-point D IT FFT structure using butterflies......................18Figure 2.3 A radix-2 Decimation-In-Frequency (D IF) butterfly structure.................... 20Figure 2.4 Flow graph o f an 8-point D IF FFT structure using butterflies......................20Figure 2.5 Flow graph o f a 16-point FFT structure based on Twiddle-Factor-

Based algorithm ................................................................................................. 23Figure 3.1 Flow graph o f a 16-point FFT structure based on Twiddle-Factor-

Based a lgorithm ................................................................................................. 27Figure 3.2 Pseudo code for the (log2N - l) stages decomposition..................................... 34Figure 3.3 B lock design representation o f (logzN -l) stages based on Twiddle

Factor decomposition a lgorithm ......................................................................36Figure 3.4 (log2N)‘ stage decomposition structure for A = 16........................................ 39Figure 3.5 Pseudo code fo r the (log2N - l) stages decomposition.....................................40Figure 3.6 V H D L Block implementation o f (log2N f^ stage based on Twiddle

factor decomposition algorithm .......................................................................42Figure 3.7 B lock implementation o f (log2N - l) stages and (log2N) stage based

on Twiddle factor decomposition algorithm.................................................. 43Figure 3.8 B lock level representation o f I-w ord R A M cell fo r storing 52-bit data... 45Figure 3.9 B lock representation o f a R A M decoder..........................................................46Figure 3.10a Basic group formation o f R A M cells using decoders.................................... 47Figure 3.10b Hierarchical group formation o f R A M cells using decoders........................ 47Figure 3.11 B lock representation o f Read Only Memory (ROM) and its controller.... 50Figure 3.12 Complex multiplication implementation using method I .............................53Figure 3.13 Complex m ultiplication implementation using method 2..............................54Figure 3.14 Real m ultiplication implementation stages......................................................55Figure 3.15 Complex multiplication implementation using method 2 .............................. 55Figure 3.16 Block level Data-path fo r group I fo r twiddle factor based architecture... 57Figure 3.17 B lock level Data-path fo r group2 for twiddle factor based architecture... 58Figure 3.18 B lock level Data-path fo r group3 for twiddle factor based architecture....59


The real focus o f the design is to build architecture to map this efficient algorithm on

to hardware retaining the maximum efficiency o f the algorithm. The complete design,

simulation and testing is done using A ctive -H D L tool which is a V H D L package

designed. The architecture designed is found to retain the memory savings capability o f

the algorithm thus enabling power efficiency.

IV


Figure 4.1 Graphical comparison fo r ROM access between twiddle factor basedarchitecture and conventional D IT or D IF based architectures.................69

Figure 4.2 Graphical comparison fo r R A M access between twiddle factor basedarchitecture and conventional D IT or D IF based architectures.................71

Figure 4.3 Graphical comparison fo r m u ltip lie r operation usage between twiddleFactor based architecture and conventional D IT or D IF basedarchitectures......................................................................................................... 72

Figure A - I I V H D L block representation o f (log 2N - l) stages R A M Addressblock based on Twiddle factor decomposition algorithm generating10-bit R A M address sequences qa and qb.......................................................80

Figure A -I2 V H D L block representation o f (log 2N )‘ stage R A M AddressGeneration block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa_logn2and qb_logn and 6-bit Temporary register bank address qaddr....................81

Figure A-I3 V H D L design o f a 7-word R A M cell fo r storing imaginary parto f data.................................................................................................................... 82

Figure A -I4 V H D L design o f a 7-word R A M cell fo r storing real part o f data............... 82Figure A-I5 V H D L design o f a 2x4 Decoder block choosing one among four

outputs based on a 2-bit input. The input port is given as de_inp_2x4_real(l :0) and output ports are represented bye_out_2x4_real( 3 :0)............................................................................................ 83

Figure A - I6 V H D L design o f a 64-words RO M and a decoder along w ithinterconnects outputting 52-bits real and imaginary twiddle factors..............83

Figure A -I7 V H D L design o f a 64-words R A M using four 76-words R A M cellsAnd a decoder along w ith interconnects outputting 52-bits realand imaginary data................................................................................................84

Figure A - I8 V H D L design block showing the data-path s for group I, group2 andgroup3. The output signals for each group are represented by b_real_put_groupX and b_imaginary_out_groupX w ith Xrepresenting individual group numbers............................................................ 85

Figure A -I9 V H D L design block showing the resolution function designedfor the data-path to choose between the three data-path groups and also for determining the Read and Write operations o f the R A Mblock and other resolution blocks...................................................................... 86

Figure A-110 V H D L design block showing R A M Address generation block w ith resolution functions that aid in enabling and disablingvarious signals at different instances o f tim e................................................... 87

Figure A -H I V H D L design block showing R A M Address generation block, the R A M blocks, the ROM memory controller block, the RO M blocksand all the other interconnects............................................................................88

Figure A -H 2 V H D L design block partially showing interconnection betweenall the blocks o f the Twiddle factor based FFT architecture..........................89

VI


ACKNOW LEDGEMENTS

I take this opportunity to thank my advisor Dr. Yingtao Jiang, for his guidance and

support.

I am grateful to professors in my thesis committee for all their valuable suggestions. I

would like to thank M r. Stan Hanel fo r providing access to A ctive -H D L whenever I

needed it most.

I wish to thank all my friends and fam ily members.

vn


TABLE OE CONTENTS

ABSTR AC T........................................................ i i i

L IST OF FIG URES...................................................................................................................... v

AC KN O W LEDG EM ENTS.......................................................................................................v ii

CHAPTER 1 IN TR O D U C TIO N .......................................................................................... II I Thesis Objective................................................................................................................ 11.2 Fourier Analysis................................................................................................................2

1.2.1 The Fourier Series....................................... 21.2.2 Fourier Transform.................................................................................................. 41.2.3 Discrete Fourier Transform................................................................................... 51.2.4 Fast Fourier Transform ..........................................................................................7

1.2.4.1 Mathematical Calculation for FFT Computational Com plexity 81.2.5 Power Aware D esign........................................................................................... 12

1.3 Organization o f the Thesis W rite-up............................................................................13

CHAPTER 2 FAST FOURIER ALG O RITHM S AN D ARCH ITECTU RE.................152.1 Introduction..................................................................................................................... 152.2 Decimation-In-Time (D IT) FFT A lgorithm ................................................................172.3 Decimation-In-Frequency (DIF) FFT A lg o rith m ......................................................182.4 Decimation Based on Tw iddle Factors.......................................................................20

2.4.1 Decimation Procedure .......................................................................22

CHAPTER 3 ARCH ITECTU RAL DESIGN A N D IM P LE M E N TA TIO N .................253.1 Overview..........................................................................................................................253.2 Design Target.................................................................................................................. 263.3 A lgorithm Setup..............................................................................................................263.4 Random Access Memory (R AM ) Address Generation Block Design...................28

3.4.1 Formulae o f Computational Complexity for the Loop Structures for (log2N - l ) Stages o f Decomposition...................................................................30

3.4.2 Pseudo Code Implementation.............................................................................333.4.3 Architectural B lock Design and Implementation............................................ 34

3.5 Formulae o f Computational Complexity fo r the Loop Structures for(log2N)‘ Stage o f Decomposition................................................................................373.5.1 Memory Reduction Technique.........................................................................38

vin


3.5.2 Pseudo Code Implementation............................................................................. 403.5.3 Architectural B lock Design and Im plem entation............................................ 40

3.6 Random Access Memory Design................................................................................. 443.6.1 Read Only Memory (ROM) design................................................................... 48

3.7 Data-Path Design............................................................................................................503.7.1 Complex M u ltip lie r Implementation................................................................. 523.7.2 Group I Design and Implementation.................... 563.7.3 Group2 Design and Implementation.................................................................. 563.7.4 Group3 Design and Implementation.................................................................. 58

3.8 Summary.......................................................................................................................... 59

CHAPTER 4 AR C H ITEC TU R AL S IM U LA TIO N RESULTS AN D DISCUSSION 614.1 Output Simulation Results............................................................................................ 614.2 R A M Address Generation Block Parameters............................................................ 614.3 ROM Address Generation Block Parameters............................................................ 634.4 Random Access Memory Parameters..........................................................................634.5 Read Only Memory B lock Parameters....................................................................... 644.6 Data-Path B lock Parameters.........................................................................................644.7 Clock Generation and Frequency Calculations.......................................................... 654.8 Memory Access Feature............................................................................................... 674.9 M u ltip lie r Operational Savings Using Tw iddle Factor Based Architecture 714.10 Design Issues and Challenges.................................................................................... 734.11 Architectural Salient Features and Drawbacks........................................................76

CHAPTER 5 CONCLUSIONS AN D SUGGESTIONS FOR FUTURE W O R K 775.1 Summary o f W ork.......................................................................................................... 775.2 Suggestions fo r the Future............................................................................................77

APPENDIX V H D L Design o f Various Architectural B locks......................................80B IB L IO G R A P H Y....................................................................................................................... 90V IT A ............................................................................................................................................. 93

IX


CHAPTER 1

INTRO DUCTIO N

l . I Thesis Objective

W ith increasing demand fo r mobile computing devices, conversion o f data between the

time and frequency domain has become vital. The Fast Fourier Transform (FFT) is one o f

the w idely used digital signal processing algorithms for this purpose. Numerous Fast

Fourier Transform algorithms have been developed over the years. Architectures, which

help realize these algorithms, have found applications in diverse areas as:

communications, signal processing, instrumentation, biomedical engineering. Sonics and

acoustics to name a few. The goal o f our work is the architectural design and

implementation o f a Fast Fourier Transform (FFT) processor, mapping an algorithm

whose decomposition is uniquely based on Tw iddle factors unlike the conventional

Decimation-In-Time (D IT ) or Decimation-In-Frequency (DIF). W ith numerous

architectures already in existence, one main aspect that differentiates this work from

other architectures is the algorithm that is used for the mapping purpose. The Twiddle

Factor based decomposition algorithm ensures a reduction in memory access by as much

as 30% [ I ] [10]. Hence, an architecture u tiliz ing this algorithm is expected to have lower

power consumption than any other conventional algorithm based architecture. The

memory reduction comes in the form o f twiddle factor access from the Read Only


Memory (ROM), where it is generally stored and retrieved. The architecture is designed,

simulated and tested in Very large-scale integrated circuits Hardware Description

Language (V H D L) environment.

1.2 The Fourier Analysis

1.2.1 The Fourier Series

Consider a sequence x(n) that is periodic w ith a period N so that x{n) = x(n + kN) for

any integer value o f k. Such a sequence cannot be represented by its z-transform, since

there is no value o f z for which the z-transform w ill converge. In such situations, this

sequence can be expressed using the Fourier Series (FS) tool [4] [13] [14] [15] [16]. Any

periodic signal can be expressed as a sum o f sinusoidal and co sinusoidal oscillations.

This decomposition is termed as Fourier Series (FS). By reversing this procedure, a

periodic signal can be generated by superimposing sinusoidal and co-sinusoidal waves.

Fourier series make use o f the orthogonal relationships o f the sine and cosine functions.

The Fourier series is extremely useful in breaking down an arbitrary periodic function

into a set o f simple terms that can be plugged in, solved individually and thep recombined

to obtain the solution o f the original problem. The general function is as follows:

/(■ ) = Oq + ^ ( a „ c o s ^ ^ + 6„ s in -^^) (1.2.1.1)n=l L L

where ao, a„, K are the Fourier magnitude coefficients o f the corresponding sinusoidal

and co-sinusoidal waves and 2L represents the fundamental frequency given by 2 ;r /T

rad/sec. The Fourier Coefficients can be determined from the fo llow ing integrals:

1 âo = — j f ( x )d x ( 1.2.1.2)

2L ,


1 r flTDC7i = — J / ( x ) c o s dx M = 1,2,... (1.2.1.3)

-L

M = 1,2,... (1.2.1.4)- L

For non-periodic functions, one can argue that they are periodic w ith an in fin ite period,

which is oo. The Fourier series then becomes Fourier Integral given as follows:

f ( x )= ^[a{w)co^m+b{a)?,mcox]dco (1.2.1.5)0

where,

a{co) = — ^ f { x ) cos caxdx (I.2 .I.6 )

1b{co )- — ^ f{x )sm o }xd x (1.2.1.7)

71

Thus, Fourier series are made up o f sinusoids, all o f which have frequencies that are

integer multiples o f some fundamental frequency. A great thing about using Fourier

series on periodic function is that the first few terms often are a pretty good

approximation to the whole function, not just the region around a special point. Fourier

series are used extensively in many major engineering applications, especially for image

processing and signal processing applications. They are also used in solving ordinary and

partial differential equations (heat conduction, wave theory) and also for various kinds o f

spectroscopy. Finding the coefficients o f a Fourier series is sim ilar to performing the

spectral analysis o f that function.


1.2.2 Fourier Transform

The Fourier Transform (FT) is basically generalization o f the Fourier series. The Fourier

Transform provides the means o f transforming a continuous time signal into its

corresponding frequency domain. Instead o f sinusoidal and co-sinusoidal terms used in

Fourier series, Fourier Transform uses exponentials and complex numbers. The Fourier

transform X (f) o f a continuous time function x(t) can be expressed as follows:

% ( / ) = j ( 1.2.2.1)

In general, X(f) and x(t) are complex valued functions, 7 being imaginary unity number,

defined as the square root o f - I and 2 z r / being the angular frequency range associated

with the signal. From the defin ition o f the Fourier integral, not every function x(t) has a

transform X(f) [2] [20] [21]. W hile the exact conditions for convergence o f functions are

not known, two conditions that are w idely considered sufficient for convergence

(Bracewell, 1986) are:

Condition I:

The integral o f | / (x ) | from to coexists [2] [20]. That is.

j|/(x )|d !x< o o ( 1.2.2.2)

Condition 2:

Any discontinuities m f(x ) are bounded [2].

Some o f the properties associated w ith Fourier Transform are Linearity, Scaling, Time

shifting. Frequency shifting. Symmetry, Modulation, Differentiation in time and

Convolution. The inverse Fourier Transform that converts frequency domain signal into


time domain performs the exact opposite functionality of Fourier Transform and has the

same complexity as the earlier. The Inverse Fourier Transform is defined as

/ ( 0 = — (1.2.2.3)2n

The Fourier Transform is used widely for image analysis, image filtering, and image

reconstruction and image compression.

1.2.3 Discrete Fourier Transform

The Fourier Transform described in the previous section can be compared to an analog

tool as it mainly deals w ith continuous signals and this is evident from the integral used

in equation (1.2.2.2). For the special case in which the sequence to be represented is o f

fin ite duration, it is possible to develop an alternative Fourier representation, referred to

as Discrete Fourier Transform (DFT). DFT is thus the Fourier representation o f finite-

length sequence which is itself a sequence rather than a continuous function, and it

corresponds to samples equally spaced in frequency o f the Fourier Transform o f the

signal [4]. Thus the Discrete Fourier Transform is used in the case where both the time

and frequency variables are discrete. In short. Discrete Fourier Transform (DFT) also

known as Finite Fourier Transform is w idely used to analyze the frequencies contained in

a sampled signal, solve partial differential equations, and to perform other operations

such as convolutions [6] [17] [18] [19]. The two important reasons for using Discrete

Fourier Transform over Fourier Transform are:

• The input and the output o f the DFT are both discrete values making it

convenient for computer manipulations.

• There is an algorithm called Fast Fourier Transform, which is a speedy way o f

computing the Discrete Fourier Transform.


The Discrete Fourier Transform is defined as:

X{ k ) = y jc(«) , 0 < t < A - l (1.2.3.1)n=0

equation ( 1.2 .3.1) can be rewritten as

N - \

X { k ) = Y .^ {n )W ^ , 0 < k < N - \ (1.2.3.2)n=0

where

(1.2.3.3)

with N being the number o f time samples or number o f frequency samples, x(n) being

input signal amplitude at time n and termed as Twiddle Factor. Thus, calculus is not

needed to define the DFT or its inverse, and w ith fin ite summation lim its, we do not

encounter d ifficu lties w ith infinities. Moreover, in the fie ld o f D ig ita l Signal Processing,

signals and spectra are processed only in sampled form, so that the DFT is what we really

need fo r computational purpose [7]. In simple terms, DFT is computationally less

intensive to compute than the Fourier Transform as can be seen below in this section. A t

the same time, the basic concepts are the same [7]. From equation (1.2.3.1) that for each

value o f k, direct computation o f X(k) involves N complex multiplications (4N real

multiplications) and N -I complex additions (4N-2 real additions) [8]. Consequently, it

2 7takes N complex multiplications and N -N complex additions to compute all N values o f

the DFT [8]. Therefore, roughly 2N^ or O(N^) are required to calculate the DFT o f length-

N sequence [2]. The Inverse Discrete Fourier Transform (IDFT) performs the opposite

functionality that o f the DFT and involves the same complexity and the same number o f

computations as the later. The Inverse Discrete Fourier Transform is defined as:


x{n) = — Ÿ , X { k ) W - ' ' \ 0 < k < N - X (1.2.3.4)^ n=0

Thus the Discrete Fourier Transform replaced the Fourier Transform resulting in

computationally capable algorithm, which has found itself applicable in wide range of

digital signal processing and image processing fields. Another aspect is the size o f the

memory required for an A-point DFT calculation. Since, each input term in equation

( 1.2 .3.2) needs to be preserved until the last output term has been computed, the

m inimum memory locations required is 2A [2].

1.2.4 Fast Fourier Transform

Direct computation o f DFT as seen from the previous section, consumes O(N^)

computations for an A-point operation. Though this method o f computation results in the

correct output, the efficiency o f this method when compared to the one to be discussed in

this section is very less. The main reason fo r the inefficiency o f the DFT algorithm is

because it does not explore the Symmetry and Periodicity properties o f the Twiddle

factor . The two properties are defined as:

Symmetry property: = -W ^ (1.2.4.1)

Periodicity property: =W ^ (1.2.4.2)

The Fast Fourier Transform algorithms, utilizes the above two properties thereby

reducing the total number o f computations from O(N^) to 0 (N log2 N) for an A-point

DFT. Due to huge difference in the computational complexity between direct DFT and

that o f FFT algorithm calculations, FFT has rapidly replaced DFT as the pragmatic tool

currently being used in every area o f science and engineering that requires

transformation.


1.2.4.1 Mathematical Calculation for FFT Computational Complexity

This section introduces Fast Fourier Transform algorithm’s computational advantage

over direct Discrete Fourier Transform calculation. The equations (1.2.3.2) and (1.2.3.3)

are reintroduced for derivation purpose.

N - \

X( k ) = Y ,x in )W lf , 0 < k < N - l (1.2.4.1.1)M=0

(1.2.4.1.2)

For derivation purpose, the value o f N is chosen to be even. Hence N can be represented

in the form N =2 ‘ , where ‘a ’ is a positive integer. Since N is chosen to be even, the entire

sequence o f x(n) can be split into two sequences each o f length N/2 w ith one o f the

sequence containing even components o f jc while the second sequence containing the odd

components. Thus equation (1.2.4.1.1) can be split into two summations each o f length

N/2 and can be rewritten as follows:

N - 2 N - l

% (t) = ^:c(n)W;* + (1.2.4.1.3)odd=\

From [2], i f 2a represents the even components and (2 a + l) the odd components w ith

a = 0 ,1 ,2 ... N/2-1 then equation (1.2.4.1.3) can be written as

% (t)= ^z(2a)W ^'^+ ^%(2a + l)W "'+')* (1.2.4.1.4)(7 = 0

X(^) = '';^x(2a)(W^)'^ +'';^z(2a + l)(W^)'^(W;) (1.2.4.1.5)fl=0 0=0

But W]j can be proved to be equal to . Hence, equation (1.2.4.1.5) becomes

%(t) = + (W ) T L D (^)G ^;% ) (1.2.4.1.6)n—0 0=0


with k=0, 1, 2..., N -l. We have stated that the DFT o f a sequence is periodic in its length,

which means that the (A ^ j-p o in t DFTs o f (a) and need to be calculated

for only N/2 o f the N values o f k. For each value o f k, the number o f addition operations

taking place in each o f the two summations is N/2. Flence total number o f addition

operations is 2(N/2) =N. In addition to the N addition operations, there are N operations

o f multiplications by (W^ ) . Hence fo r N values o f k, the total number o f operations is N^

additions and N^ multiplications for a total o f 2N^ operations or can be just represented as

O(N^) as the index value 2 has comparatively lesser value. This represents the

computational complexity when direct DFT method is employed. For large values o f N,

direct DFT computation becomes tedious and not practical to be implemented by digital

systems. For N = 8, equation (1.2.4.1.6) can be diagrammatically represented using

butterfly structures which depict the computation o f X(k) at the output from N values o f

x(n )â t the input.

x(0)

x(2)

x(4)

x(6)

x(1)

x(3)

x(5)

x{7)

N/2-pointDFT

N/2-pointDFT y/=\t :

X(0)

X (1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

Figure 1.1: Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components. The dots represent points wherein addition operations take place. The integers next to the large arrow-marks represent multiplications that take place due to ) [2]


Using the property o f periodicity, the FFT algorithms calculate DFT o f an iV-point input

by splitting it into even and odd components just as direct DFT is computed, just that

instead o f calculating k value from 0 t i l l N - l, they only calculate k value from 0 until

N/2-1 and the calculated values are re-used for k=N/2, N/2+1, , N - l. Hence, the total

number o f operations get reduced from O(N^) fo r direct DFT computation to 0(N^/2)

resulting in 50% lesser computation. Additional computational reduction can be brought

using the Tw iddle factor ) . Using the property o f Symmetry, W,k+NI2N -VFk, . Hence,

instead o f calculating (W^J) fo r k values from 0 until N - l, it is enough i f we calculate for

values o f k from 0 t i l l N/2-1 and the remaining values o f k w ill be symmetric to the

previous values. As a result, fo r an A-point input, only N/2 values o f tw iddle factors have

to be calculated. Thus figure 1.1 can be redrawn as

x(0)

x(2)

x{4)

x(6)

x(1)x(3)

x(5)

x(7)

N/2-pointDFT \ \ y V * \

N/2-pointDFT

w l *

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

Figure 1.2: Data Flow graph o f an 8-point DFT calculated by splitting the N point input into two N/2 parts containing even and odd components and using only N/2 twiddle factors. The dots represent points wherein addition operations take place. When compared to the figure 1.1, only four, that is only N/2 number o f tw iddle factors are calculated while the remaining N/2 are just represented as sign due to symmetry property.

10


The in itia l process was to divide A-point DFT into two A/2 point DFTs o f even and odd

components. The same process can be repeated thereby breaking down A/2 points further

down to two N/4 points o f even and odd DFTs and the process o f division can be

repeated until, it results in 2-point DFTs, which signals the end o f the splitting process as

calculating a 2-point D FT is very simple. The number o f stages o f such decomposition

for an A-point input is logg A . The FFT decomposition is shown diagrammatically as:

x(0) e

x (2 )e

x(1) • x(3)&

x(5)&

x (7 )e

•X (0 )

•X (7 )

Figure 1.3: Data Flow graph o f an 8-point FFT, showing the entire decomposition until 2- point DFT stage is reached. The stage w ith W \ is the last stage o f decomposition w ith all butterflies being 2-point DFTs.

Each o f the A/2 2-point FFT’ s requires one addition, one subtraction and there are A/2

twiddle factor multiplications per stage. One the whole, every stage requires 0(N )

operations per stage. Since there are logz A stages on the whole, the total number o f

operations for an A-point FFT becomes 0 {N \o g j A ). Thus, when compared to direct

DFT computation that has a operational complexity o f O(N^), FFT algorithms only have

0(Nlog2N) which results in 50% and higher savings in computation making FFT one o f

the fastest and computationally most efficient form o f transformation algorithms

11


developed w ith the significance clearly fe lt in applications requiring large values o f N

computation. FFT algorithms have virtually overthrown computation through direct DFT

method and have found usage in a wide spectrum o f fields ranging from sonar/radar

detection, cellular communication, digital signal processing applications, image

processing, designing High Defin ition Television (HDTV), medical imaging to name a

few. Thus FFT algorithms have become an integral part o f many scientific and

engineering applications and hence there is an ever-increasing necessity to design more

efficient FFT algorithms as well as architectures that map these efficient algorithms onto

realizable hardware components.

1.2.5 Power Aware Design

W ith the semiconductor industry well into the deep sub-micron era, lim itations on

physical dimensions have led to unprecedented challenges in terms o f behavioral aspects

o f devices. Greater challenges in terms o f power dissipation, which posed triv ia l

challenges in the earlier stages, are now posing immense constraints on devices being

designed. Heat dissipation from such devices has become a fie ld o f research by itself,

w ith the problem seeming to worsen as the device sizes shrink. The advent o f mobile

computing and communication devices has taken system design to a new high. In

addition to designing and improving circuits for increased computational ability w ith an

increased processing capability, higher operational clock frequency, higher throughput

and increased packaging density, hardware circuits in deep sub-micron era need to be

designed for lower power consumption to improve the life o f battery on which these

mobile devices solely depend, thereby reducing constraints on heat dissipation causing

circuit breakdowns. Power reduction techniques are implemented at every level o f design

12


abstraction starting from device modeling all the way until circuit design at the gate and

transistor level and layout. Special techniques have also been developed at the fabrication

level. W ith so much importance for design and implementation o f power aware

architectures and circuits, we take up the task o f designing and simulating architecture

that maps a unique FFT algorithm mainly aimed at memory access reduction, which

could ultimately lead to lesser power consumption at the architectural level.

1.3 Organization o f the Thesis Write-up

The entire report is organized into five chapters w ith the first chapter giving a basic in-

depth into the necessity fo r this work and few o f the technical startups needed to

understand the underlying concepts behind our work.

Chapter 2 gives a comprehensive description about various Fourier algorithms and

architectures based on these algorithms w ith specific focus on Fast Fourier Transform

algorithms. Detailed explanation along w ith supporting mathematical equations has been

provided for clarity purpose. Chapter 2 also gives a comprehensive coverage o f the

Twiddle Factor Based FFT algorithm, which forms the background o f our architectural

design. A clear insight on how the decimation is performed is also given.

Chapter 3 explains in detail the architectural technicalities involved in our design. The

various logic blocks designed and simulated are shown pictoria lly and a good insight on

the computational complexities o f various blocks along w ith equations has been

demonstrated. Memory reduction techniques have also been sighted to enhance the

effectiveness o f our work.

13


Chapter 4 deals w ith simulation output waveforms o f various important blocks that

have been designed and explained in chapter 3. In addition, various plots that depict the

advantage our architecture and the algorithm it is based on possess over other

architectures are plotted. The chapter also summarizes in mathematical units the effective

power savings and memory access reduction achieved.

Chapter 5 concludes our work w ith a brie f summary o f our results and some

suggestions for future designers depicting the scope for improvements that can be

extended to our work to enhance its effectiveness.

14


CHAPTER 2

FAST FOURIER TRANSFORM ALGO RITHM S AN D ARCHITECTURES

2.1 Introduction

As seen in chapter 1, the Fast Fourier Transform is computationally the most efficient

way o f computing the Fourier transformation. From Section 1.2.3, the calculation o f

direct form o f DFT requires O(N^) and from Section 1.2.4.1, the calculation o f DFT

through FFT algorithm requires only 0(Nlog2N) computations for an A-point input. For

small values o f A, this difference is not significant. But for large values o f A, the FFT is

orders o f magnitude more efficient than direct DFT calculation. Table 2.1 below gives a

comparative analysis o f the computational advantage o f FFT over direct DFT calculation

for various values o f A. A number o f FFT algorithm variants have been designed over the

years each having inherent computational advantages and disadvantages. FFT algorithms

on the broader spectrum are divided into two main types namely Decimation-In-Time

(D IT) and Decimation-In-Frequency (DIF). Either o f the types involves splitting the input

data points into odd and even components, perform DFT operations and then recombine

the calculated values to form the output transformed data points. The D IT form o f FFT

algorithm is formed by splitting x(n) which represents the time domain, into even and odd

components o f N /r data sequence and then continuing the decomposition or splitting

15


operation t i l l r-point DFT sequence is reached, where r is specified in terms o f N as

N = r ' ' , w ith k being a positive integer.

Table2.1: Comparative tabulation o f DFT and FFT computational efficiency

Transform length

m

DFT operations FFT operations DFT operations

-Î-FFT operations

16 256 64 4

128 16,400 896 18

1024 1.05 X 10^ 10,240 102

1,048,576 1.1 X 10^^ 2.1 X 10^ 52,429

The total number o f decomposition stages is given by logrN, w ith the logrN stage

having all its computation as r-point DFT operations. Hence i f r=2, then decomposition

of x(n) would proceed until the stage wherein all DFTs operations or so called butterfly

structure have just 2-points each for computation form ing a rad ix-r (w ith r —2 in this

case) D IT-FFT algorithm. On the other hand, the D IF form o f FFT algorithms are formed

by splitting X(k) which represents the frequency domain, into even and odd components

o f N /r data sequence as in case o f D IT explained above. Hence, this type o f algorithm is

termed as rad ix-r D IF FFT algorithm. On a comparative basis, D IT and D IF are

computationally same, thus enabling using either o f the two forms o f algorithms.

16


2.2 Decimation-In-Time (D IT ) FFT Algorithm

As we saw in the previous section, Decimation-In-Time FFT algorithm fo r N = / , is

derived by splitting A-point input sequence into N /r equal sequences o f even and odd

components o f the input data. For example, i f r=2, then A-point input data sequence

given by x(n) is divided into two N/2 sequences one containing even and the other

containing odd components o f x(n). The equation for output X(k) is given by equation

(1.2.4.1.6) which is restated below for convenience.

X ( t ) = (a )G V ^,) + ( W ^ ) ^ L o ( « W ; : ^ 2 ) (2.2.1)fl=0 a=0

Equation (2.2.1) can be generalized as

%(t) = f / t ) + w;;, FzW (2.2.2)

Splitting X(k) into two A/2 components results in the fo llow ing equations

FzW, t =0, 7, 2.. . , AC2-7 (2.2.3)

%(t+A/2) = F;(t) - F2(t), t =0, 7,2. . . , 1W2-7 (2.2.4)

Equation (2.2.4) has a negative sign as compared to (2.2.3) because o f the fact

thatW^^'^'^ = . A butterfly is a structure that diagrammatically represents equations

(2.2.3) and (2.2.4). Using butterflies to draw flow graphs simplifies the diagrams and

makes them easier to read.

17


« X=A+BW

Y=A-BW

Figure2.1: A radix-2 Decimation-In-Time (D IT) butterfly structure

x(0

x(4

x(2

x(6

x(1

x(5

x(3

x(7

3 ><

z xZ X

• X(4)

Figure2.2: Flow graph o f an 8-point D IT FFT structure using butterflies

2.3 Decimation-In-Frequency (DIF) FFT Algorithm

As in the case o f D IT , Decimation-In-Frequency is obtained by splitting the input

sequence into N /r sequences. I f r=2 (say), then the A-point input x(n) is split into two

sequence each o f N/2 points. Unlike in D IT wherein input data is split into even and odd

terms, the D IF just involves splitting the input sequence into N /r sequence. To derive the

algorithm, we begin by splitting the DFT formula into two summations (since r=2), one

o f which involves the sum over the first N/2 data points and the second sum involves the

last N/2 data points [8]. Thus from [4] [5] [ I I ] [12] we can write

18


or

(N/2-1) (V-1)%(Æ)= (2.3.1)

n=0 n=N!2

(A //2 -1 ) (A f/2 -1 ) ^

% ( t ) = E % (»+— )iy ;* (2.3.2)M=0 M=0 ^

Combining the two summations in equation (2.3.2) and using the fact that W,(N/2)kN

= (-1) , we obtain

( A ' / 2-1) AT

% W = %][X») + (-l)*Jc(» + — (2.3.3)n=0 2

Considering the even and odd components o f k and representing them w ith X(2a) and

X(2a+1) respectively, so that

(V /2 -1 ) AT

%(2a)= 2][x(M) + XM + — (2.3.4)n=0 2

{N/2-D AT

% (2 a + i)=M=0 2

a = 0, 7..., (A^-7) (2.3.5)

Thus D IF equations can be generalized for r=2 as

X(k) = F](k) + F 2(k) (2.3.6)

N

19


# X=A+B

Y=(A-B)WB e

Figure2.3: A radix-2 Decimation-In-Frequency (DIF) butterfly structure

X(0)

* & - • X(1)

-1 -1

Figure2.4; Flow graph o f an 8-point D IF FFT structure using butterflies

2.4 Decimation Based on Twiddle Factors

In microprocessor-based system, memory access is expensive mainly due to larger

latency and higher power consumption [ I ] [10]. From figure 2.4, it can be seen that the

various twiddle factors used for the DFT operations depicted by the butterfly structure are

repeated at different stages and even w ith in each stage. For example, the twiddle

factor gets used in stage I as well as stage2. Even w ith in stage2, is used twice.

20


Hence in this 8-point D IF FFT, is seen to be used for a total o f three times. In terms

o f computation, it involves accessing memory thrice to bring in the same twiddle factor

to perform DFT computation. Thus, the redundancy in twiddle factor memory access is

obvious. For larger values o f N, the redundant memory access becomes substantial

leading to higher power consumption. A unique Twiddle-Factor-Based FFT algorithm

[ I ] , is designed to reduce the frequency o f memory access as well as multiplication

operations. The algorithm is mainly divided into two sections based on the Twiddle

factors that are present. The firs t section named as Super Stage (SS), computes the

butterflies involving tw iddle factors ( /# 0) through a computation scheme sim ilar to

Hoffman coding [ I ] . In this section, all butterflies that use the same twiddle factor are

clustered together and computed, thereby having to load that twiddle factor only once to

compute all the butterflies that use it, instead o f accessing the same twiddle factor each

time a butterfly that uses it needs to be computed, resulting in substantial memory access

reduction. In the second section, (N-1 ) butterflies involving the tw iddle factor are

computed using a top-down tree structure. Simulations proved a 20% reduction in clock

cycles and an average o f 30% reduction in memory access for a 32-point FFT using

Twiddle-Factor-Based FFT algorithm when compared w ith the conventional D IF FFT

algorithm [ I ] [10].

Hence, using the Twiddle-factor-Based FFT algorithm, i f a twiddle factor gets loaded

from the memory, it gets utilized until there is no further need for it in any further

computations. This in terms o f number o f memory access is only (N/2-1) for an A-point

input for Twiddle Factor Based FFT as compared to (N-1) fo r conventional FFT

algorithms. The power saving can be significant using the Twiddle-Factor-Based FFT

21


algorithm especially fo r large values o f N. One main advantage this algorithm has apart

from having lesser memory access is that the second stage dealing w ith does not

involve any multiplication as W °= l. Hence we also have extra power savings through

non-usage o f m ultipliers that are power hungry and computationally intensive.

2.4.1 Decimation Procedure

In case o f Twiddle-Factor-Based FFT algorithm described in [ I ] [10], the butterflies that

are computed at each stage are spread across log^ N stages o f a conventional

decimation. The Twiddle-factor-Based FFT decomposition is shown in figure 2.5. As can

be seen from figure 2.5, butterflies w ith the same twiddle factor (represented by W [x],

w ith X varying from 0 t i l l (N/2-1) and this representation is analogous toWy^), that were

computed at different stages in a conventional algorithm, now gets computed w ith in a

single stage thereby avoiding the necessity to load the same twiddle factor numerous

times as compared to just once in [ I ] [10]. The decomposition o f the algorithm proceeds

in the fo llow ing fashion. For an A-point FFT, the binary index o f a data sample resembles

(AkAk-i.... Ao), w ith k= log 2N -l.

(1) A t the first stage o f decomposition, all data samples o f the form (AkAk-i.... i j are

computed and any two data samples o f the form (AkAk-i.... 1 ) and (AkAk-i.... 1 )

can pair together to form a butterfly [ I ] .

22


gg

hlX{M)

bbiX(9)

liBlLX(l3)

STAGE 1 STAGE 2

jaaX(15)

STAGE 3

SUB-STAGEl SUB-STAGE2 SUB-STAGE3 SUB-STAGE4

STAGE 4

Figure 2.5: Flow graph o f a 16-point FFT structure based on Twiddle-Factor-Based algorithm

The twiddle factor that corresponds to this butterfly is given as where j corresponds

to the decimal value o f the binary sequence given by

(OA1C-1 ....A2A 1 1 ).

(2) A t the second stage o f decomposition, all the data samples w ith binary sequence

(AkAk-].... A 2 10) and (AkAk-i.... A 2A 1 I ) are computed. Any two data samples w ith

binary sequence (AkAk-i.... A 2 IO) and (AkAk-]....A 2 lO ), or (AkAk-i....A2A ] l ) and

{AkAk-i A 2A 1 I ) can pair together to form a butterfly. The tw iddle factor

corresponding to this butterfly is where j corresponds to the decimal value o f

the binary sequence given by (OAt-y .-Az 1 0 ).

23


(3) W ith in log2N - l stages o f decomposition, all data samples w ith twiddle factor

other than are calculated.

(4) The stage involves butterflies whose corresponding tw iddle factor value is

(000... 1).

The obvious advantage that can be seen from the above form o f FFT decomposition is the

number o f memory access that is made w ith regards to accessing the various twiddle

factors. Thus we only need (N/2-1) memory access as against (N-1) required by

conventional algorithms [1] [10]. Another interesting aspect in addition to reducing the

memory access is that the log2N‘ stage o f decomposition involves as its twiddle

factor and =1. Thus there is no m ultiplication involved w ith butterflies using this

tw iddle factor and hence the fina l stage decomposition involving (N-1) butterflies does

not involve any multiplication thereby helping us save valuable m ultiplication operations.

Thus, the Twiddle factor based FFT algorithm seems to be a more appropriate

algorithm that caters to the need for power aware FFT systems.

24


CHAPTER 3

AR C H ITEC TU R AL DESIGN AN D IM PLEM EN TATIO N

3.1 Overview

In chapter 2, we had a comprehensive explanation about the Twiddle-Factor-Based FFT

algorithm and also the computational advantages it had over other conventional FFT

algorithms was explic itly shown. Designing architecture to map this computationally

challenging but intensely less memory access-involving algorithm is the main aspect o f

our work.

Until the early to mid-90s, low power electronics were for the major part considered

only fo r a very few applications largely comprising o f small personal battery-powered

devices. But the m id 90s saw the remarkable development o f CMOS sub-micron

technology, which subsequently led to the advent o f deep sub-micron CMOS era.

Another radical change that revolutionized the electronics market was the unprecedented

demand and subsequent development o f portable communication and computational

devices that m ainly depended on battery power for their operation. Unfortunately, the

battery industry could not keep pace w ith the developments in the semiconductor

industry. As a result, high-end electronic portable devices needed constant battery

recharging, which made them less user friendly. This resulted in extensive research done

towards design and implementation o f newer V LS I algorithms, architectures and circuit

25


techniques that would utilize the minimum possible power w ithout sacrificing any o f the

other parameters such as bandwidth, clock speed, area and throughput.

Our architectural design, based on a power reduced FFT algorithm is a small step in

this direction.

3.2 Design Target

The FFT algorithms have a wide range o f signal processing and communication

transmission applications. In recent times, one important area where FFT has found

extensive application is H igh Defin ition Television (HDTV). According to the standard

o f European D igita l Broadcasting, FFT/IFFT must execute 8192 points in 896

microseconds. In addition, we also target our architecture towards Orthogonal Frequency

D ivision M ultip lexing (OFDM) transceiver whose IEEE 802.1 Ig standard requires it to

execute 7024-point FFT in 51 microseconds.

3.3 A lgorithm Setup

We recall the Twiddle Factor Based FFT decomposition structure shown in figure 2.5.

From the structure, the algorithm can be broadly classified into three distinct divisions

namely Input, Processing and Output stages.

26


x(I4ix(ia

W2)

JBuX(10)

jm iX (14 )

gg

imX(13)

dmX(ll)

jmilX(15)

STAGE 1 STAGE2 STAGES

SUB-STAGEl SUB-STAGE2 SUB-STAGE3 SUB-STAGE4

STAGE 4

Figure 3.1: F low graph o f a 16-point FFT structure based on Twiddle-Factor-Based algorithm

A t the Input stage, complex form data values corresponding to the N unprocessed, in

other words time domain signals x(n) where n=0,l,2 ,... ,(N-1) are input into the memory

device from external source (generally other blocks whose output need FFT processing).

The address location where the input data values would get stored depends on values o f

n. That is, i f data value corresponding to x(n) needs to be stored, it gets stored at a

location whose address is given by binary equivalent o f (n+ 1 ) , that is i f data

corresponding to x(0 ) needs to be stored, it gets stored in a memory location whose

address is binary equivalent o f (n+1 ) or (0+1 =1 ). S im ilarly for x(4), the address is binary

equivalent o f (4+1=5). The number o f binary bits to represent memory address depends

on the size o f the memory device as shall be seen in the coming sections. Thus, data is

fed into memory starting from x(0) all the way up to x(N -l), in subsequent locations.

27


The proeessing division is where the DFT butterfly operation takes place. This division is

equivalent to the Central Processing U nit o f any computer system. This block transforms

input time domain signals x(n) into its corresponding frequency domain signal X(k).

The transformed X(k) output values are written back to the same locations in the

memory device from where they were in itia lly accessed. Thus after all operations, the

same memory locations which contained time domain signals would now contain

frequency domain signal output. The output division accomplishes this process. For any

given N, the total number o f decomposition stages is given as log,. A , where A = / , k

being any positive integer. Hence the above stated process is repeated log,. A times

resulting in the final values o f frequency domain X(k) signals.

3.4 Random Access Memory (R AM ) Address Generation Block Design

The functionality o f the R A M address generator is to determine

(1) Total number o f stages o f decomposition

(2) Total number o f DFT butterfly operational groups w ith in each stage based on

twiddle factors

(3) Determine the memory address locations o f the data needed for each o f the DFT

butterfly operation.

The R A M address generator is one o f the few complex operational blocks designed in our

architecture. Based on our observation o f the Twiddle-Factor-Based decimation FFT

algorithm, we can make the fo llow ing conclusions assuming A=2* k is any positive

integer.

(1) Total number o f stages o f decomposition fo r an A-point input = /ogzA

28


(2) Only the butterflies in decomposition stages 1 t i l l (log2N-2 ) involve complex

multiplications.

1 3 5 (NIT, - 1 \(3) Butterflies w ith twiddle factors form the (log 2N - l ) ‘ stage

,(LOGN - 2)

and involves only multiplication w ith imaginary term j which can be effectively

implemented by switching or swapping real and imaginary terms o f the output

and inverting the sign term between them, thus avoiding use o f complex

m ultiplier.

For the purpose o f R A M address generation, we came up w ith a pseudo code in C++

which could generate addresses o f all butterfly DFTs based on Twiddle-Factor-Based

FFT algorithm decomposition described in [1] [10]. The Pseudo code consists o f two

parts w ith the first describing the decomposition for the first (ZogzA-i) stages and the

second describing the (log2N ) ^ stage. The first Pseudo code depicting address generation

o f the first (ZogzA-i) stages consists o f four nested loops.

(1) Loop 1: corresponds to the current stage number (between 1 and (logzN -l))

(2) Loop 2: corresponds to the total number o f different twiddle factors that are used

with in a given stage

(3) Loop 3: fo r a given twiddle factor, we have butterfly DFT structures that vary in size.

Loop3 computes total number o f different sizes o f butterflies that utilize a given

twiddle factor.

(4) Loop 4: Computer total number o f butterflies for a given size and a tw iddle factor

29


3.4.1 Formulae o f Computational Complexity for the Loop Structures

fo r (log2N - l) Stages o f Decomposition

Loop 1 Computation - The loop 1 as seen from section 3.4, computes the total number o f

stages o f decomposition excluding the last stage, which is generally computed separately

as this stage does not involve any tw iddle factor multiplication. Excluding the last stage

o f decomposition, the total number o f stages is (log2N - l) and hence loop l index varies

from 1 to (log2N -l) .

Loop 2 Computation - The generalized formula for the calculation o f Tw iddle factors

that are used in each o f the (log2N - l) stages, where N = / , k being any positive integer is

given as

W \ , where jc= i, 3, 5, 7... [ (N /r^ ') - l] (3.4.1.1)

where cs represents the current stage number

The current stage value (cs) varies from 1 to (log2N -l) . For our architectural design we

assumed r=2. Hence, (3.4.1.1) becomes

W \ , where x=7, 3, J, 7... /(ZW2 7 (3.4.1.2)

= , where x=7, 3, 5, 7... 7(ZW2 ")-77 (3.4.1.3)

The total number o f butterflies in each stage is given as

( 2 " -1 )A)(CS+1)

(3.4.1.4)

Equation (3.4.1.4) gives the total number o f butterfly operations per stage. Thus, for

(log2N - l) stages, the total number o f butterflies is given as

30


log2 N - \ JtJ

I (3 4.1.5)CS=1 ^

Using equation (3.4.1.3) and the values N=16 and r=2, we can work out an example to

calculate the tw iddle factors used at various stages o f decomposition using the Twiddle-

Factor-based FFT algorithm. Number o f stages w ith tw iddle factor multiplications is

given as (log2N - l) = (log2 l 6 - l ) =3. Thus the stage index value or CS varies from 1 to 3.

Stage#l:

Tw iddle factors accessed when C 5= l is given as

" , wherez=7, j , 5, 7... /(A /2")-77

= , where x = l, 3, 5 and 7

Therefore the tw iddle factors that are used when C5'=l areW/g W,g Wjg and W j

Stage#2:

Twiddle factors accessed when CS=2 is given as

, where z=7, J, 5, 7... 7(7W2")-77

= ' , where x = la n d 3

Therefore the tw iddle factors that are used when CS=2 are W l and lT,g

Twiddle factors accessed when C5=3 is given as

w;" ' , where x=7, J, 5, 7... 7(A/2")-77

= Vkif' where x = i

Therefore the tw iddle factors that are used when C 5=l is IT,g

Loop 3 Computation - For a given tw iddle factor w ith in any given stage, there can be

DFT butterfly structures o f varying sizes, where size o f a butterfly w ith reference to

31


conventional FFT algorithm can be said to be (N/2^^) where CS corresponds to current

index value o f stage number which varies from 1 to log2N fo r a given value o f A. In case

o f conventional FFT algorithms, all butterflies that were computed w ith in a single stage

were o f the same size. On the other hand, Twiddle-factor-Based FFT algorithm,

computes butterflies based on their tw iddle values. Hence, butterflies that are computed

w ith in a single stage need not necessarily contain butterflies o f the same size. The total

numbers o f different sized butterfly structures that utilize a single twiddle factor at any

given stage o f decomposition equals the value o f the stage o f decomposition where the

butterfly is currently present. For example, IT,g twiddle factor that is present in stage 1,

would have butterflies o f only one size using that twiddle factor as it lies in the first stage.

Similarly, IT,g which lies in stage 3, would have butterflies o f three different sizes. The

different sizes might also be specified in terms o f levels, w ith each level being occupied

by butterflies o f one particular size. Hence three different sizes for a single tw iddle factor

means there are three levels fo r that tw iddle factor. The size o f butterflies varies starting

from A/2 for the first level and successive steps having half the size o f the butterflies in

the previous level. For example, we saw that IT,g had three different sizes o f butterflies.

Hence,

Size o f butterflies present in level 1 = A/2

Size o f butterflies present in level 2 = V2(N/2 ) = N/4

Size o f butterflies present in level 1 = ViiNM) = N / 8

Loop 4 Computation - Having formulated the number o f stages o f decomposition, total

numbers o f twiddle factors per stage and the total number o f levels o f butterflies per

32


twiddle factor. We are le ft w ith calculating the total number o f butterflies to be computed

in each o f the levels. The formula for calculating the total number o f butterflies present

per level is given as For example, again considering tw iddle factorW,g, we can

thus calculate the number o f butterflies in each o f the three levels. Thus

Total number o f butterflies present in level 1 = = 1

Total number o f butterflies present in level 2 = = 2

Total number o f butterflies present in level 3 = 2 ' = 4

An important aspect o f the butterfly decimation is that w ith in a given level, the butterflies

can be computed in any order. S im ilarly for a given twiddle factor w ith in a stage, the

various levels can be computed in any order and so can be done w ith different twiddle

factors w ith in a given stage. Thus the parallelism can be exploited at various levels o f

decomposition. This feature makes the Fast Fourier Transform the most sought after

transformation algorithm in numerous signal processing and communication applications.

3.4.2 Pseudo Code Implementation

Based on the four loop structure calculated fo r the first (log2N - l) decomposition stages

based on Twiddle-Factor-Based FFT algorithm, a C-like pseudo code to implement the

DFT butterfly operations over (log2N - l) stages is shown below in figure (3.2).

bf_size = N/2 LOOP 1:i = 1 to (log2N -l) , i+ + a7 = (2 ^ (W ))

LOOP 2.

bf_size = N/2 a = a la = a + 2 ( j - l ) * a l

33


LOOP 3:K = (i-1) to 0, k- -

LOOP 4:m = 1 to (N/(bf_size*2)) b = a + bf_size c[m ] = aa = a + (2 *bf_size)D FT Butterfly operation to be performed

LOOP 4 ENDS a = c [ l ] / 2

bf_size = (bf_size/2 )

LOOP 3 ENDS LOOP 2 ENDS LOOP 1 ENDS

Figure 3.2: Pseudo code for the (log 2N - l) stages decomposition

3.4.3 Architectural Blocks Design and Implementation

For the pseudo code designed in figure (3.2), a logic design at the behavioral level is done

using Very large-scale integrated circuits Hardware Description Language (V H D L) tool.

Aldec Inc., license version o f Active H D L version 6.3 was used entirely to design and

simulate the various architectural blocks. The architectural implementation o f the (log2N-

1) stages o f decomposition based on Twiddle factors is shown in figure (3.3). On

comparing the pseudo code w ith the architectural blocks, we can find that the block

i j j lo c k corresponds to loop l, v^hWe j_b lock and k jb lock correspond to loop2 and loop3

respectively. Due to logical implementation complexity, loop4 is implemented using

three blocks namely m jblock, m ljb lo c k and m2_block. The main functionality o f the

R A M address generator as stated earlier is to generate address o f data elements present in

34


the memory unit that are needed for all the D FT butterfly operations over the entire

decomposition stages. For every butterfly operation, two data elements are required,

which implies that two addresses are required to be generated by the R A M address

generator fo r every butterfly operations that gets computed. From the architectural

design, qa and qb represent the two binary address values that get generated. Each o f the

two outputs is a binary sequence o f w idth determined by the size o f the memory unit it

accesses as can be seen in the upcoming sections. A ll the blocks are o f non-pipelined

nature wherein only one block remains operational at any instant. The blocks gets

initiated one after the other in the order required automatically by triggering signals

which present w ith in one block trigger the next after the completion o f execution by

current block. There are six blocks in total implementing the four nested fo r loops. The

value o f qa is generated at multiple locations along the six blocks. The correct value o f qa

that corresponds to any particular butterfly operation is chosen among the different

values based on a resolution function block. The alphabet V represents a signal that

toggles fo r every value o f loop l index and in turn triggers j jb lo c k which then gets

executed loop2 index times and V ’ represents the situation wherein jjb lo c k completes

executing loop2 index times and hence control is transferred back to ijb lo c k fo r further

processing. S im ilarly, W and W’ represent the computational triggering signal between

j_b lock and k_block. On the same ground, we can explain the functionality o f X, X ’, Y, Z

and Z ’. The complete V H D L implementation o f the architecture described in figure (3.3)

is shown in A P P E N D K . The other important signals that form a part o f our design are

the clear, enable and clock. The clock is basically a synchronizing signal whose complete

functionality is explained in section 4.7. The clear signal is used as an erasing signal

35


which when active high or 1 erases all the values at the output ports and assigns them to

high impedance.

dock

dearenable k blockenable

To e n a b le (logN ) s la g e b lo ck s W"

loop3dockdear q ak

q a

m2 block ml blockq a m R eso lu tion

block—►

Figure 3.3; B lock design representation o f (log2N - l) stages based on Twiddle factor decomposition algorithm

The enable signal is basically to determine i f a block needs to be operational at any given

instant. Any block proceeds w ith its execution only i f enable signal remains active low or

0 , fa iling which, the previous output o f that block is retained irrespective o f changes in its

input.

36


3.5 Formulae o f Computational Complexity fo r the Loop Structures

for log2]Sf Stage o f Decomposition

In section 3.4, we saw the R A M Address generation block design fo r the first (log2N -l)

stages o f decomposition based on tw iddle factors. The final stage or log2N‘ stage o f

decomposition is probably unique when compared to the other stages mainly in terms of

usage o f multipliers. The log2N ’ stage is classified in such a way that all the butterflies in

this stage irrespective o f their levels or sizes use only the tw iddle f a c t o r . The value of

is always equal to 1, irrespective o f the value o f N. DFT butterfly calculations based

on equations (2.3.6) and (2.3.7) reveal that using is equivalent to m ultiplying

equation (2.3.7) by 1. The tw iddle factors used for multiplication in the first (log2N -l)

stages are complex in nature (containing real and imaginary terms). Hence when these

complex twiddles get into equation (2.3.7), they result in complex multiplications, which

are computationally complex and intensive to design and implement. Hence the absence

o f complex multiplications in the (log2N"') stage makes this stage computationally less

intensive and results in large savings in terms o f clock cycles for computation and also in

terms o f power dissipation. The total number o f butterflies that are computed in this stage

is given as (N-1 ). Since there is no multiplication by twiddle factor involved in this stage,

we can disable all the multipliers that are designed to handle multiplications. This

procedure is more complex in other conventional FFT algorithms. Though W° factor is

present even in other FFT algorithms, they get utilized at different stages o f

decomposition, unlike in Twiddle factor based algorithm [1] wherein gets utilized

only in one stage. Hence in conventional algorithms, disabling the multipliers at different

time intervals is more d ifficu lt. This is one obvious advantage Twiddle-factor-based FFT

37


algorithm has over other FFT algorithms. As a result, we can disable the multipliers for

clock cycles equivalent to implementing (N-1 ) complex multiplications. This results in

power savings as digital circuits like memory units and multipliers are power hungry.

Though power efficient m ultipliers are being designed, it is more suitable to reduce the

usage o f multipliers.

3.5.1 Memory Reduction Technique

The Twiddle-Factor-Based FFT algorithm [1] [10], describes a methodology by which

memory access can be reduced even at the (log2N) stage. From figure 3.1, butterflies

w ith in the (log2N) stage is decomposed into (log2N) further sub-stages as depicted in

figure 3.4. From figure 3.4 it can be seen that in sub-stage 1 (S I), the two outputs o f

butterfly are represented as A and B. The output value A serves as input for C and that o f

B used as input for D present in sub-stage2 (S2), while input values for points E and F do

not depend on their previous stages. Hence input to these points has to be accessed

directly from memory units. It becomes redundant i f A and B get stored in the relatively

slow memory units and then C and D accessing those values back from the same memory

unit. Instead, A and B can be stored in temporary registers from where C and D can

access them.

38


JCC12)

XC13)

%(3)«11)«7)«13)

S U B -S T A G E SI S U B -S T A G E S2 S U B -S T A G E S S U B -S T A G E 4

S T A G E 4

Figure 3.4: (log2N f^ stage decomposition structure fo r # = 16

W hile one o f the inputs to every butterfly can thus be accessed from temporary registers

where the previous stage outputs get stored, the second input to the butterflies is only

accessed from the main memory units. As a result o f this, main memory accessing gets

reduced by as much as 50%. Accessing data from main memory units is more power

consuming than accessing them from temporary registers simply because, temporary

registers can be built according to user needs and can be placed closer to the processing

unit. Moreover the main memory structure is designed in a more complex way when

compared to the temporary register bank. The decoding for temporary register location is

also simple when compared to the main memory. Based on our evaluation, the total

number o f temporary registers required for an A-point input is N/2.

39


3.5.2 Pseudo Code Implementation

A C-like pseudo code fo r implementation o f (log2N) stage is given in figure (3.5). It is a

two for loop structure generating the addresses qa and qh that represent the binary

equivalent o f the addresses o f the two memory locations from where data fo r computing

the butterfly is accessed. This pseudo code only deals w ith address generation and not

w ith memory reduction concept.

bf_size = N/2 LOOP 1:k = 1 to (log2N), k+ + a l = 0 ; a= 0

LOOP 2:m = 1 to (N/(bf_size*2)), m ++ a = a + a l b = a + bf_size c = a; d = b; c = (a + b) d = (a — b) a l = (bf__size) * 2

LOOP 2 ENDS bf_size = (bf_size/2 )

LOOP 1 ENDS

Figure 3.5: Pseudo code for the (log2N - l) stages decomposition

3.5.3 Architectural Blocks Design and Implementation

The two-loop pseudo code is logically designed using V H D L at the behavioral level. The

blocks are designed in such a way that o f the two data values that are required for

40


computation o f every butterfly, one data comes from the temporary register while the

other comes from main memory unit. On one hand, the outputs from the butterflies that

are computed in the (log2N - l) sub-stages are stored back into temporary registers while

on the other hand, the outputs o f butterflies computed in the (logzN/^ sub-stage get stored

into the main memory unit rather than in the temporary registers mainly because, the

(log2N f^ sub-stage forms the final stage o f computation and hence it forms the output

stage and the output values need to be written back onto main memory from where they

would be accessed by other systems. Figure (3.6) shows the architectural blocks o f the

(log2Nf'^ stage implementation. As explained in section 3.4.3, the functionality o f A, A ’,

B, B ’, C, C ’ , D and D ’, is to trigger the successive blocks to which they are respectively

connected. The clear, enable and clock signals perform the same functionality as that

explained in section 3.4.3. The q_temp_reg_addr is the address corresponding to the

temporary register location from wherein data needs to be accessed. The q b jo g n

represents the second address needed fo r performing the butterfly operation, which is

fetched from R AM . The V H D L block implementation o f the architecture explained in

figure (3.6) is given in APPENDIX. On the design front, we plan on designing a 1024-

point FFT architecture targeting H D TV application. For this purpose, we need 512

temporary registers each o f 32-bits wide. The 512, 32-bit temporary registers are

implemented as a tree structure. In itia lly , one 32-bit temporary register is designed. It is

then duplicated or in V H D L terms port mapped to form the second temporary register.

41


clockc le a r

d ockc le a r

clockclear

en ab leq b jo g n 1 q je m p _ r e g _ s d q n

From (logN-1) s ta g e s block

q je m p _ re g _ a d d r

clockclear

clockclear

iogn_block5enatde en ab le

q b jo g n 2 q je m p _ re g _ a d d r2

logn_block3

lo g n _ b lo c k 4

q a j o g n

Figure 3.6: V H D L Block implementation o f (log2N f stage based on Twiddle factor decomposition algorithm

Now, the two identical registers are considered to be a single entity and this is then port-

mapped to form two more identical registers, resulting in four registers. Proceeding this

way, we get to design 512, 32-bit temporary register bank. During the different stages o f

algorithm execution, data is constantly written onto and read out from the temporary

registers. Hence, there is a necessity to generate the address o f the temporary register

where data is to be currently written or to be read. The logic block implemented as shown

in figure 3.6 generates the address corresponding to the right temporary register. This

address is decoded in stages by a decoder. Each stage o f temporary register design has its

own decoder, which decided the exact location o f the temporary register.

The number o f address bits needed to specify a location in the temporary register bank

is determined using the fo llow ing expression

Number o f address bits = log2 (total number o f register locations) (3.5.3.1)

Hence, in case o f temporary register bank, we have N/2 or 512 register locations.

Substituting in equation (3.5.3.1), the number o f address bits is 9. Thus we need 9-bit

42


binary sequence to represent each o f the temporary register locations. In the previous

section, we saw that in the (log2N f^ stage, data is accessed both from main memory unit

as well as from temporary registers. To achieve this, the design incorporates multiplexers

to choose between the two data source. Each temporary register has a Read and Write

signal. I f the Read signal is set to high or 1, data is read out o f temporary register, while

making Write signal high on the other hand, enables us to write data into the registers.

The complete block architecture for address generation fo r both (log2N - l) stages and

(log2N f ' stage is shown in figure (3.7).

LfelotSc I *mrnbW k Wof*

To (logN)Wqe Wock* " 1 ^ ' V'

Ioop1 loops

H l.,v

......1.....

r . * * » mWock

X'

qak

qam R w o W o r

qbJogN-1

clock dockcW r

k>gfi bfodf.1 A ...► kgn.M ocia

en»Meg J b g n .b b c K l

LO'

LRwoAAor

Wochfog #0*2

qb

Address to temp register

W logn

Figure 3.7: B lock implementation o f (log2N - l) stages and (log2N f^ stage based onTwiddle factor decomposition algorithm

43


3.6 Random Access M emory Design

The Random Access Memory (RAM ) is the main memory system designed and

simulated in our architecture structure. Our FFT architecture is designed for 1024 points

input system. Hence we have 7024-points time domain input signals that need

transformation. For this purpose, the input data needs to be stored in a memory unit from

where they can be accessed whenever necessary. The design and implementation o f the

Random Access Memory (R AM ) tends to serve this purpose.

The biggest advantage that FFT algorithms possesses that other transform algorithms

do not is that o f In-place computation. The FFT algorithms as seen get computed in

stages. In conventional FFT algorithms, input data as well as output data are accesses in

and out o f R AM , which serves as the primary memory unit. The input data that is used in

the computation at each stage is only needed fo r that stage. Once the output to that stage

gets generated, the inputs that this stage used are no more needed and can be replaced by

the output data obtained. This process o f replacing or re-using the same set o f memory

locations that stored the input data to store output data values is termed as In-place

computation. This concept results in m inimum requirement and usage of memory unit

size, which makes it optimal for digital systems. Hence, for an A-point FFT

implementation.

The number o f M ain memory (R AM ) locations needed = N (3.6.1)

I f each memory location is represented by the term word, then, for a 7024-point FFT

implementation, we need 1024 words or locations in the R A M to store all the input and

output values. The address locations for the memory locations start from binary

4 4


equivalent o f 0 and continue all the way up to 1023, thus specifying all the 1024

locations. Each location can store data o f width 52-bits. From equation (3.5.3.1), the

number o f bits requires to address each location can be calculated to be 10. Hence, the

address sequence starts from 0 0 0 0 0 0 0 0 0 0 all the way until 1 1 1 1 1 1 1 1 1 1 w ith increment

o f 1. Two distinct but sim ilar R A M blocks having the same set o f address sequence is

designed to store both the real and imaginary part o f the complex data input o f the form

(a+ jb) w ith a and b being any real number. Thus, accessing the exact memory locations

from both the blocks simultaneously results in accessing both the real and imaginary

parts o f an input data. The basic structure o f a 7-word R A M is shown in figure (3.8). This

structure is common fo r both the real and imaginary parts o f R A M design block.

clock

Enable from decoder

Data to be written — —

Signal to read out data

Signal to write in data

Data to be read

Figure 3.8: B lock level representation o f 1-word R A M cell fo r storing 52-bit data

From figure (3.8), the basic building block o f a R A M word consists o f a Read, Write,

Clear and enable from decoder as its signals. The Read and Write signals enable us to

input data into and output data respectively from any memory location. The Clear signal

i f high or 7 erases the contents o f the memory location and replaces the output as well as

the content o f the memory to high impedance. Only i f the enable from decoder is made

45


active low or 0, w ill it be possible to access data from or into the location. Otherwise,

whatever was the previous output continues to remain at the output port and no changes

get reflected. The structural design methodology followed for R A M is sim ilar to the way

the temporary register bank was designed in section 3.5.3. The V H D L implementation o f

the R A M cell is shown in APPENDIX.

In order to access the exact location in the R A M block, we need to decode the address

b it sequence generated by the R A M address generator. The decoder block implements

this procedure. The decoder is basically a de multiplexer, which chooses one among its

many outputs based on the value o f the input. The basic structure o f a decoder is shown

in figure (3.9).

clock

Decoder enable

2-bli address to choose between 4 outputs

RAM

decoder

output 1

o utgu t2

o u tgu t3

outgut4

Figure 3.9: B lock representation o f a R A M decoder

In our design, we implement one decoder to decode 2 address bits thus enabling us to

locate four R A M word cells. In other words, one decoder can help us operate four

different memory locations. Thus for a 7024-points R A M cell locations, we need 341

decoders in total. This can be made clear by figure (3.10) and the explanation given

below. We know that we need 7 decoder to decode 4 R A M cells and are shown below.

Let this combination be termed as a group. Hence, a group contains 4 RAM cells and 7

4 6


decoder. Based on the previous calculation, i f we have four separate groups, each having 4

RAM cells and 1 decoder as in figure (3.10a), then we got to a total o f 16 RAM cells and 4

decoders decoding these 16 RAM cells.

CELL 4CELLSCELL1 CELL 2

DECODER 1

Figure 3.10a: Basic group formation o f R A M cells using decoders

This is depicted in figure (3.10b). To choose one group among the four available, we need a

decoder. Thus, for a total o f 16 RAM cells decoding, we get to use 5 decoders. Let us term

this as high group.

ij I cai7[ j cEu.*! CEU.2 CELL3 CEU.

DECODER

Figure 3.10b: Hierarchical group formation o f R A M cells using decoders

47


Proceeding on a sim ilar ground, we can see that for four such high groups to get decoded,

we need 20 decoders, thus decoding 64 R A M cells. In order to one among the four high

groups, we need a decoder. Hence fo r decoding 64 R A M cells, we need a total o f 21

decoders. The calculation can thus be extended to higher levels o f grouping and thus for a

1024 cells R A M structure, we would need a total o f 341 decoders. The V H D L block

level design o f such a hierarchical R A M and decoder stmcture fo r 64-words R A M is

shown in APPENDIX. The disadvantage o f such huge memory units is obvious from our

previous discussions. The humongous hardware requirements hamper the performance of

such memory units in terms o f operational speed, data retrieval and storage time, and area

and power consumption. Hence in recent times, high-end research is devoted to designing

and fabricating huge memory devices w ith minimum hardware. As a result, high-speed

memories such as cache and flash memories have been designed so as to reduce the

operational delays and also power consumption. These high-end memories are designed

to be very small and hence are area efficient as well. A ll these benefits have created a

tremendous scope for such memories and are widely being utilized in many system

designs especially mobile devices where area and power savings are o f primary

importance.

3.6.1 Read Only Memory (ROM)

In the previous section, we saw the design and implementation o f Random Access

Memory (RAM ). The R A M is a memory unit wherein data can be written, read and

erased. Thus R A M can be termed as an erasable memory. On the other hand, ROM is a

Read only option memory wherein data to be stored is pre-determined during its

fabrication and is hardwired. Thus the data once hardwired cannot be changed and as a

48


result, no new data can be written on to a ROM cell. Hence there is only a Read option

and no Write option. The RO M is thus used only i f there are stored values that do not

change during the course o f execution. In any FFT algorithm, the tw iddle factors , k

=0, 1... ((N /2 )-l) do not change their values once computed. Hence for a given A-point

FFT, the entire N/2 number o f complex tw iddle factors is stored in the ROM, whose size

is determined as N/2. Thus fo r our 1024-point architecture, we need 512 locations in the

ROM to store the various tw iddle factor values w ith each location being 52-bits wide.

Since, there are 512 R O M locations; we need 10-hit binary sequence to address every

ROM location. The address to the RO M blocks are generated by ROM address

generation block, which keeps generating consecutive address locations o f twiddle

factors when ever it gets triggered by a signal from the R A M address generation block.

Every time the second fo r loop in the R A M address generation block gets executed, a

new twiddle factor needs to be accessed from ROM. Hence, a trigger signal is sent to the

ROM address generation block indicating that address for the new tw iddle factor be

generated by it and sent to the ROM blocks which w ill then output the tw iddle factor

values to the location o f the butterfly operation. Figure (3.11) below shows the ROM

blocks designed in our architecture. V H D L block representation o f RO M is shown in

APPENDIX.

49


dock p,

dear ^

ROM con toiler enable

Signal to initiate ROM from RAM address genaratof blocks

KUIV1 access address

ROMContrôler

clock ^ dear ^

|p.

512-wordsROM

Triggeringsignal to output

new twiddle factor

Twiddle factor data t>-

Figure 3.11: Block representation o f Read Only Memory (ROM) and its controller

3.7 Data-Path Design

W ith the address generation block, Random Access Memory and Read Only Memory

design complete, the architecture can now generate the addresses o f the locations from

where time domain signals stored in the Random Access Memory and twiddle factors

from the Read Only Memory can be accessed. Once the time domain signal and

corresponding twiddle factor data from the memory units get accessed, they must be

transformed into frequency domain output signal. In other words, DPT butterfly

operation needs to be performed on the accessed input data. For this purpose, we

designed the data-path block. It is in this block that arithmetic operations such as

addition, subtraction and m ultiplication are performed.

The formulation for each o f the DFT butterfly operations in the Twiddle-Factor-Based

FFT algorithm [1] is based on equations (2.3.6) and (2.3.7) and may be recalled for

clarity.

%(ik) = F # ) + F2W (3.7.1)

50


k= 0 , l,2 . . . ( N /2 - l ) (3.7.2)

Equations (3.7.1) and (3.7.2) involve one addition, one subtraction and one

multiplication. The (log 2N) stages Twiddle-Factor FFT algorithm [1] can be divided into

three distinct groups based on twiddle factors.

Group 1: consists o f butterflies from stages 1 t i l l (log2N-2 ). The tw iddle factors that are

utilized in any o f these stages are complex in nature, which is o f the form (a+ jb) w ith a, b

being any real number. The input data F j (k) and F 2 (k) stored in the memory units are

also complex in nature as the twiddle factors itself. Hence, when equations (3.7.1) and

(3.7.2) get computed, we need to perform one addition, one subtraction and one complex

multiplication which are computationally more intensive than a real multiplication.

Group2: consists o f butterflies in the (log2N -l)^ ‘ stage. The twiddle factor that gets

utilized in this stage is o f value 7= . Hence, when equations (3.7.1) and (3.7.2) get

computed, we need to perform one addition, one subtraction and one triv ia l multiplication

w ith just the imaginary term j . We term this multiplication triv ia l because m ultiplication

w ith j can be easily performed by just swapping the input real and imaginary terms and

invert the sign between the two terms. Thus, there are no m ultiplication operations

actually involved. Hence all multipliers designed can be disabled.

Group3: consists o f butterflies in the (log2N) stage. The twiddle factor that gets utilized

in this stage is whose value is always 1 irrespective o f the value o f N. Hence, when

equations (3.7.1) and (3.7.2) get computed, we need to perform just one addition and one

51


subtraction and there are no multiplications involved in this group. Hence all multipliers

designed can be disabled.

From the above three groups, we can find the addition and subtraction operations to be

common, while in case o f m ultiplication, only group 1 involves complex multiplication

and the other two groups do not involve any multiplication. Hence, exception for addition

and subtraction operations, we need three separate methods, by which we can choose

between performing complex multiplications, swapping operations and no

multiplications.

3.7.1 Complex M u ltip lie r Implementation

As seen from section 3.7, butterflies computed in group 1 have complex multiplications.

Each o f the butterfly operation thus involves one complex multiplication. Complex

multiplications are more complicated when compared to real multiplications as they

contain two terms real and imaginary. Hence complex multiplication involves more

computations than an ordinary real multiplication. There are two methods for

implementing complex multiplications in digital system. From equation (3.7.2) complex

multiplication takes place between F 2 (k) a n d W j, where k=0, 1..., ((N /2)-l). I f complex

F 2 (k) is considered to be o f the form (a+ jb) and the complex term is considered to be

o f the form (c+ jd) w ith a, b, c, d be any real numbers. The complex m ultiplication now

becomes

(3.7.1.1)

= ac+jad+jbc-bd (3.7.1.2)

52


Equation (3.7.1.2) can be implemented in two methods.

Method 1: This method is a straightforward implementation o f equation (3.7.1.2).

Equation (3.7.1.2) has two addition, one subtraction and four real m ultiplication

operations. The terms ac and bd represent the real term o f complex multiplication while

ad and be represent the imaginary part. Pictorial representation o f the multiplication

method 1 is given below in figure (3.12).

Method2: This method implements equation (3.7.1.2) using three additions, two

subtractions and three real multiplications [17]. In this method, the number o f real

multiplications gets reduced at the cost o f increase in the number o f additions and

subtractions. Pictorial representation o f the multiplication method2 is given below in

figure (3.13).

Figure 3.12; Complex multiplication implementation using method 1

O f the two above discussed methods o f complex multiplication implementation, method2

has an obvious advantage in terms o f number o f real multiplications. But the trade-off is

53


an increase in the number o f addition and subtraction operations. In general digital circuit

design terms, a m ultip lie r design and layout is on the higher side in terms o f area and

power consumption. It consumes more clock cycles for its execution when compared to

an adder or subtraction circuitry. Taking these factors into consideration, method2 of

implementing complex multiplication is better when compared to method 1. Hence our

design incorporates method2 for implementing complex multipliers.

ac-bd

Figure 3.13: Complex m ultiplication implementation using method 2

From [9], it is seen that the implementation o f a real m ultip lier involves three major steps

namely Booth encoding. Partial Product reduction and Carry propagate addition. The

purpose o f these three steps in order specified is to reduce the number o f computations on

the multiplication operations. The partial product reduction is based on Wallace tree

structure. Complete description about the three steps is mentioned in detail in [9].

Assuming each o f the three steps takes one clock cycle to execute, it takes three complete

54


clock cycles fo r implementing one real multiplication. This is depicted in figure (3.14).

The complex multiplication using methodZ can be implemented using 3 real

multiplications, 3 real additions and 2 real subtractions. The complete block level

implementation is shown in figure (3.15). The implementation is seen to consist o f 5

stages, w ith each stage consuming one clock cycle.

Booth Partial Product Carry PropagateEncoding Reduction Addition

Figure 3.14: Real multiplication implementation stages.

tmag part of FFT output

R eel part o f FFT ou tpu t

S ubi

CPA

CPA

C PA

EncodingM uit2

En coding _ M ulti

PartialP roduct

r e d u c t io n

P a r tia lProduct

reduction

PartialProduct

reduction

stage 1 Stage2 stages StageA S tag eS

Figure 3.15: Complex multiplication implementation using method 2

55


3.7.2 Group 1 Design and Implementation

In the previous section, we saw two ways by which complex multiplication could be

implemented. We also considered one o f the two methods to be more suitable fo r our

design. Having described the logic implementation methodology, we need to now see the

V H D L implementation o f the various blocks that make up data-path for group 1

butterflies described in section 3.7. Based on equations (3.7.1) and (3.7.2) we can design

our data-path. Equation (3.7.1) involves addition o f the two input data F i(k) and F 2 (k).

Since F j(k) and F 2 (k) are both complex in nature, we need to add the real and imaginary

parts o f the two data separately to satisfy equation (3.7.1). Hence we need two real adders

to implement equation (3.7.1). Equation (3.7.2) can be split into two parts w ith the first

being (F i(k}-F 2 (k)) and the second being complex m ultiplication w ith . Hence, the

second equation involves one subtraction and one complex multiplication. Due to the

complex nature o f data, we need to have two separate subtracters one each fo r real and

imaginary term. The block diagram depicting the data-path for group 1 as designed in our

architecture is shown in figure (3.16).

3.7.3 Group2 Design and Implementation

We know that the tw iddle factor that gets utilized in this group is o f value j . Based on

equations (3.7.1) and (3.7.2) we can design our data-path for group2. Equation (3.7.1) is

same as that for group 1 butterflies. Hence we can retain the same adders and subtracters

used for group 1 implementation. Equation (3.7.2) can be split into two parts w ith the first

being (F i(k)-F 2 (k)) and the second being multiplication w ith j . As discussed earlier, just

swapping the real and imaginary terms and inverting the sign between the swapped terms

can achieve m ultip ly ing w ith imaginary term j. Hence, the multipliers are disabled

56


a_reaLoutputa real

b realajmagiiia ry_outpul

a imao

EncodingMuM2

b_real_output

Twkidia tactor real

T *klcHe_faclor Imag

CPA

CPA

CPA

Adds

Sub1

Add1

Add 2

Sub3

EncodingMulti

EncodingMuH3

P a r t ia l...

ProductfeducUon

Partial Product j

reduction

PartialProduct

reduction

Figure 3.16: B lock level Data-path for group 1 fo r tw iddle factor based architecture

whenever butterflies o f this group get executed. O f the 32-bits o f data, the first b it

represents the sign bit and is 1 for negative numbers and 0 fo r positive numbers. Once

subtraction operation takes place, we use two registers to swap the real and the imaginary

terms. The two registers get activated only when group2 butterflies get computed. Thus

data-path for group2 utilizes the same adders and subtractors that were used for group 1

computation. It only requires two new registers to swap real and imaginary data. The

block diagram depicting the data-path for group2 as designed in our architecture is shown

in figure (3.17)

57


a real

boreal

a ..imaa.j 1

b jm ag

Addi

Add 2

a_r0al_ou1put

aJmaginary„oytput ►-

a realb_reaLoiJtputSubi

b real

8 im agî " g Sub2

b J m a g ! ______

Figure 3.17: B lock level Data-path fo r group2 fo r twiddle factor based architecture

3.7.4 Group3 Design and Implementation

We know that the tw iddle factor that gets utilized in this group is o f value 1. Based on

equations (3.7.1) and (3.7.2) we can design our data-path for group 3. Equation (3.7.1) is

same as that fo r group 1 and group2 butterflies. Hence we can retain the same adders and

subtracters used fo r group land 2 implementation. Equation (3.7.2) can be split into two

parts with the firs t being (F ](k)-F 2 (k)) and the second being multiplication w ith i . As

discussed earlier, m ultip ly ing w ith 1 can be just ignored. Hence equation (3.7.2) just

involves two real subtractions one each fo r real and imaginary term. Hence fo r group3

butterfly computations, we do not need to design any further blocks. We only disable the

multipliers used in group 1 and the two swapping registers used in group2. The block

58


diagram depicting the data-path for group3 as designed in our architecture is shown in

figure (3.18)

a realW Add1

b real

a_real_output

a irnaq _AddZ

b imag

a real

ajmaginary_output

b realSub1 b real output

a imag — bJmag_outpul Sub2 -------------------------------------------------------------------------------------------------------- ►

bJmag I-----------

Figure 3.18: Block level Data-path fo r group3 fo r twiddle factor based architecture

3.8 Summary

In this chapter, we summarize the various aspects o f design o f the main blocks o f the

architecture we designed based on Twiddle-Factor-Based FFT algorithm [1]. The main

blocks include the Random Access ^ e m o ry Address generator. Read Only Memory

Address generator, the 1024-words Random Access Memory (RAM ), 572-words Read

Only Memory (ROM), Address decoders. Temporary register bank and Data-path. A few

multiplexers are also designed and used in the architecture mainly used to regulate the

flow o f various signals across the blocks. Signal clarifications are done using Resolution

function blocks that are designed at necessary locations. This is done mainly because few

o f the signals have either multiple sources or destination thereby necessitating resolution

functional blocks to resolve the signal conflicts. The resolution functions subsequently

59


increase the complexity o f the architecture which can be seen from the operational speed

o f the design.

60


CHAPTER 4

ARCHITECTURAL S IM U LA T IO N RESULTS AND DISCUSSION

4.1 Output Simulation Results

From chapter 3, we can obtain the architectural design o f the various blocks that are used

in our proposed FFT processor. A ll the blocks are designed using VH D L. In this chapter,

we get to simulate the various blocks that were previously designed. By simulation, we

obtain the output waveforms fo r the blocks for different input signals under varying

control signal environments. We shall also determine numerical values o f some important

design parameters such as clock frequency, number o f arithmetic operations and hence

number o f arithmetic operators required to meet the target time constraint and the

operational speed o f the designed processor. During this discussion we shall come across

some o f the main advantages as well as drawbacks the design has and also some

important challenges that require special attention for future designers.

4.2 R A M Address Generation B lock Parameters

This is the logic V H D L block designed to generate b it sequences that represent addresses

o f memory locations in the Random Access Memory (RAM ) from where complex data

values needed to perform DFT butterfly operations are accessed. The logic for the

address generation is based on [1] [10] and the pseudo codes for both the (log2N - l) stages

and (log2N f^ stage are given in sections (3.4.2) and (3.5.2) respectively. In blocks that

61


represent the (log2N - l) stages, qa and qb represent the two output ports through which the

address values are output. Both ports output 70-bit binary sequence. For the (log2N f^

stage computation, qaddr and qb form the two ports outputting address values to the

temporary registers and R A M respectively. W hile qaddr outputs 9-bit address, qb outputs a

70-bit address sequence. The difference in the number o f bits between the two ports is

mainly because o f the difference in the size o f the two different memory units. The

enable signal is used to enable or disable the block. Disabling the block (enable signal

= ’ 1’ ) retains the previous values o f the output ports. The other signals that affect the

functioning o f this block are the clear. The clear signals represent the clear function

which when made 7, erases the values at the output ports by assigning them to high

impedance value z. Once the (log2N - l) stages are executed, the blocks corresponding to

(log2N f^ stage get activated. Each o f the blocks designed get activated one after the other

w ith one block triggering the successive block. As a result, the inherent delay w ith in the

R A M address generation block for generating successive address is 2 (20nsec) to 4 (40

nsec) clock cycles (w ith 2 being the dominant delay factor) depending on which loop the

control is currently in and in which loop the next address is to be generated. The inherent

delay is mainly attributed to the non-pipelined nature o f the design. As a result, each loop

w ith in the address generation unit gets initiated by its preceding loop, which then remains

idle t il l all the succeeding loops complete their execution. As a result o f this idle nature,

we encounter more delay than in a pipelined structure wherein every block operates at

every clock cycle.

62


4.3 ROM Address Generation B lock Parameters

The block signifies address generation fo r accessing twiddle factors needed for DFT

butterfly operations. The output port generates the 9-bit address needed to access the

twiddle factors from the various 512 RO M locations. Once the address is output, a

triggering signal initiates the RO M block to generate the data value corresponding to the

address from the ROM address generation block. The real and the imaginary data values

are output from separate RO M address blocks. The clear represents erasing signal and

has the same functionality as described in section 4.2. The ROM block is designed to

generate address without any delay and hence it can generate address every clock cycle as

per the requirement. There is no inherent delay in this block.

4.4 Random Access Memory Parameters

The R A M as described is a 52-bit, 7024-words block. The 70-bit address generated for

data access from R A M is decoded by the ram_decoder, to find the exact location from

where data is to be read out or stored. Once the address gets decoded, data from that

corresponding location gets read out or data gets stored in depending on whether read or

write operation is specified. The read and write signal get activated automatically by the

controller design, which activates the read signal as soon as address gets generated by the

R A M address generator. W hile the read signal is high or 7, write signal remains active

low or 0 and vice-versa. There is no inherent delay in this block and hence once address

is placed for decoding, data from R A M can be accessed w ith in the same clock cycle.

63


4.5 Read Only Memory B lock Parameters

The ROM, which stores the twiddle factors, is a 52-bit, 572-locations block. The input to

the ROM is a 9-bit address sequence from the RO M address generator block. Once the

address is present at the input port, the data stored in that address location is loaded on to

the output port. The output is a 52-bit tw iddle factor data. As soon as the address from the

ROM memory controller gets generated, the twiddle factor value from the corresponding

location is loaded onto the output port. There is no inherent delay in this block and hence

data can be accessed w ith in one clock cycle (lOnsec).

4.6 Data-Path B lock Parameters

The data-path serves as the arithmetic back bone o f the processor under design. From

sections 3.7.2, 3.7.3 and 3.7.4, data-path seems to vary between the three sections with

some components being common among the three. The addition and subtraction

operations in equations (3.7.1) and (3.7.2) respectively are common while the varying

component is the m ultiplication w ith the tw iddle factor. Figure (3.13) clearly specifies the

various blocks that are designed to minimize the number o f m ultiplication operations.

Assuming the arithmetic blocks getting executed one after the other, one block at a

time, it takes 5 clock cycles (50 nsec) excluding the operation o f loading the data into the

data-path from memory unit and writing them back into memory unit after data-path

operation, which takes one clock cycle each to implement any DFT butterfly arithmetic

operation in group 1.

6 4


Group 2 data path does not involve any multiplication as only swapping between real

and imaginary terms are necessary. Hence on the whole it takes 2 elock cycles (20nsec)

excluding the operation o f loading the data into the data-path from memory unit and

writing them back into memory unit after data-path operation, which takes one clock

cycle each to implement any DFT butterfly arithmetic operation in group 2.

Groups data-path is probably the simplest among the three as it only involves

multiplication w ith i , which can just be neglected. Hence other than the addition and

subtraction operations specified in equations (3.7.1) and (3.7.2), no other arithmetic

operations are required. Hence the number o f clock cycle’ s butterflies in this group take

to complete the arithmetic operations is just 1 (10 nsec) excluding the operation o f

loading the data into the data-path from memory unit and writing them back into memory

unit after data-path operation, which takes one clock cycle each to implement any DFT

butterfly arithmetic operation in group 2.

4.7 Clock Generation and Frequency Calculation

As seen in section 3.2, our architecture is designed to target two main applications

namely H D T V and OFDM transceiver. W hile H D TV FFT necessitates 8792-points in

896 microseconds, O FDM transceiver requires 7024-points execution w ith in 57

microseconds. O f the two target tim ings, OFDM transceiver’ s 57micro-seconds is o f the

shortest duration and hence we need to design our architecture to satisfy 7024-point

execution w ith in 57 micro-seconds which i f satisfied would also help achieve 8792-

points FFT execution in 896 micro-seconds. In order to achieve the tim ing target, we

65


need to first determine the clock operational frequency. The clock is basically used to

synchronize the various blocks that are designed and used in the architecture at any given

instant. For example, i f two distinct blocks need to get executed at the same instant, we

need to ensure that they start and end their executions exactly at the same instant and not

at d iffering time intervals. To ensure this, we need a signal that can synchronize all the

blocks o f the architecture. To ensure synchronization, we need to make sure that the

blocks gets enabled or disabled either at the rising or fa lling edge o f the clock pulse. The

clock frequency thus determines how frequently the clock signal rises or falls which

ultimately determines the number o f calculations that can be performed w ith in the

required time frame. On the other hand, we can also determine the clock frequency based

on the number o f operations i f known. In our design, we determine the clock frequency

based on our target tim ing o f 51 microseconds to perform 1024 points FFT operations.

We thus need to determine the number o f operations that are involved in 1024 points FFT

calculations. To determine the clock frequency, we do the fo llow ing calculations.

(1) The (log 2N f^ stage o f operation based on Twiddle Factor decomposition involves

no tw iddle factor multiplications. Assuming a pipelined structure wherein all the

blocks o f a given design have the ability to operate simultaneously and no block

needs to be idle and wait fo r data from its previous block. This enables us to

generate an output every clock cycle. This assumes one clock cycle period for

every butterfly operation through the data-path as described in section 3.7.4.

Total number o f butterflies involved in the (log2N)'^ stage = (N-1 ). (4.7.1)

Hence, total number o f clock cycles needed assuming one clock cycle for every

butterfly operation = (N-1 ) (4.7.2)

66


(2) The (logzN -lf^ stage involves butterflies w ith imaginary term ‘j ’ twiddle factor

multiplication.

Total number o f butterflies involved in the (log2N - l f ' stage = ((N /2)-l). Hence,

total number o f clock cycles needed assuming one clock cycle for every butterfly

operation = ((N/2 ) - l ) (4.7.3)

(3) The firs t (log2N-2 ) stages involve complex multiplications.

Total number o f butterflies involved in the (log2N-2 ) stages

(log, N-2)is ^ ( A / 2 ' ' ^ ) ( 2 ' - l ) (4.7.4)

i=l

Hence, total number o f clock cycles needed assuming one clock cycle fo r every

(log, N-2)butterfly operation = ^ (A /2 ^ '^ '^ ) (2 ' -1 ) (4.7.5)

f=l

Substituting N=1024 in (1), (2) and (3), the total number o f butterfly operations involved

= 5120. Hence assuming one clock cycle for every butterfly operation, the total number

o f clock cycles necessary fo r N=1024 is 5120. Since our target tim ing is 51

microseconds, to implement 5120 clock cycles w ith in 51 microseconds, each clock pulse

should be o f time width 0.02 microseconds. Hence the corresponding clock frequency

becomes i 00 Mega-hertz.

4.8 Memory Access Feature

The architecture under design is based on Twiddle-factor FFT algorithm [1] [10], which

claims to reduce the number o f memory access when compared to conventional FFT

algorithms. Since our architecture maps the tw iddle factor based FFT algorithm, it is also

67


expected to reduce the memory access when compared to other architectures targeting

conventional FFT algorithms.

W hile describing the decomposition procedure fo r Twiddle-Factor Based FFT

algorithm, we showed that once a tw iddle factor gets loaded from ROM on to the data

path, no other tw iddle factor is loaded until all the butterflies that use this particular

twiddle factor get computed. In other words, every tw iddle factor gets loaded only once

during the entire architecture implementation. The number o f twiddle factors that are

present in the decomposition o f an N-point FFT is (N/2). Hence based on the above

explanation, the total number o f times the tw iddle factors would be accessed from the

ROM where they are stored is (N/2).

In case o f conventional FFT algorithms namely D IF or D IT FFT algorithms, a

tw iddle factor gets loaded from ROM every time a butterfly gets accessed. The total

number o f butterfly DFT operations that get computed for an A-point input is (N/2)

log2N. Hence for any conventional FFT algorithms, the total number o f times the twiddle

factors would be accessed from the RO M where they are stored is (N/2) log2N.

In terms o f our architectural design, we can see that from loop 2 execution given in

section (3.4.1), the total number o f times loop 2 gets executed is given by

(4 8 1)

where loop l varies from 1 t i l l (log2N - l)

In addition to (4.8.1), the final (log2N)‘ stage, utilizes only W° factor.

Hence the total number o f RO M memory access in our architecture is given as

+ 7 (4.8.1)

where loop l varies from 1 t i l l (log2N - l)

68


Hence in terms o f ROM access, our architecture seems to comply w ith the memory

access reduction claim o f [1] [10]. Thus for various values o f N, the variation in ROM

memory access o f our architecture based on [1] [10] as compared to other architectures

based on conventional FFT algorithms like D IT and D IF are in a graphical form in figure

(4.1). In addition to the memory savings obtained through Read Only Memory (ROM),

the twiddle factor based architecture can obtain additional memory savings through lesser

Random Memory Access (R AM ) accessing.

Computation o f every butterfly DFT structure necessitates 4 R A M accesses two o f

which are fo r reading out the input data from R A M and the other two for storing back the

computed result back on to the same locations in the RAM . Hence for any given A-point

input conventional FFT algorithm and hence any architecture based on it, there are

N2A log2 A R A M accesses since there are a total o f — log^ A number o f butterfly

operations each requiring 4 R A M accesses.

Comparison of Read Only Memory (ROM) memory access between Twiddle factor based

architecture and Conventional FFT algorithm based architectures

6 0 0 0 -

5 0 0 0

4 0 0 0

3 0 0 0

2000

1000

y y / / / yNumber of FFT Input data points

- R O M a c c e s s in c o n v e n t io n a l F F T a lg o r i th m b a s e d a r c h i t e c tu r e s

- R O M a c c e s s in tw id d le f a c to r b a s e d a r c h i t e c tu r e

Figure 4.1: Graphical comparison for ROM access between twiddle factor based architecture and conventional D IT or D IF based architectures

69


In case o f twiddle factor based architecture, the final stage o f decomposition involves

butterflies utiliz ing as the only tw iddle factor. There are (N-1 ) butterflies in the final

Nstage o f which the first (— - 1) butterflies have only 1 R A M access as all other three

Naccessing is done from the temporary registers. The last (— ) butterflies on the other

hand have only 1 input accessing from R A M while both data at the output o f the butterfly

gets stored back on to the R A M instead o f the temporary registers. This is because these

set o f butterflies form the last stage o f operation and hence the final FFT data result needs

Nto be accessed from RAM . Hence these (— ) butterflies each have 3 R A M accessing.

Hence the total R A M accessing from the last stage o f decomposition is

N 3N(— -1 ) + = (2N - 1). The butterflies in the remaining stages have 4 R A M access as

usual. Hence, total number o f R A M memory access for all stages is given as

N4((— log2 N ) - ( N - 1)) + 2N - 1 = (2N log2 N - 2N + 3). Hence, when compared to the

(2 Nlog2 N) R A M access obtained from conventional D IT or D IF based architectures,

(2A log2 A - 2 A + 3) seems to be a significant reduction o f as much as 25%. This

directly has an impact on power consumption as it can be further reduced. Though

temporary register accessing is still present, it is much faster when compared to the slow

R A M accessing m ainly due to its small sizing and its positioning closer to the data-path

than the RAM . This reduction in R A M memory access is shown pictoria lly in the graph

depicted in figure (4.2).

70


Comparison of Random memory Access (RAM)access between Twiddle Factor based architecture and Conventional FFT algorithm

based architectures

„ 21000 I « 180005 8 15000 "8 « 12000 I s' 9000 E E 6000 i E 3000

n

Ær <cT rAr

Number of FFT Input data points

- RAM access in twiddle factor based architecture

- RAM access in conventional FFT algorithm based architectures

Figure 4.2: Graphical comparison for R A M access between tw iddle factor based architecture and conventional D IT or D IF based architectures

4.9 M u ltip lie r Operational Savings Using the Twiddle Factor Based Architecture

As seen from sections (3.7.2), (3.7.3) and (3.7.4), the entire decomposition operation o f

FFT based on Twiddle factors can be classified into three groups, two o f which namely

group2 and group3 do not involve any multiplication operations. As a result, the

multipliers designed and used in the data-path can be disabled while butterflies from

these two groups get computed. In conventional FFT algorithms, butterflies requiring no

twiddle factor multiplications are available in every stage o f decomposition along w ith

other butterflies that do involve multiplications. Hence disabling the multipliers at

different time instances becomes very complex and hence most architectures do not

disable the multipliers at any instant [22], [23]. As a result, irrespective o f whether a

butterfly involves m ultiplication operation or not, the m ultip lier used in the data-path

remains enabled thus consuming unnecessary clock cycles as well as power. In our

71


architecture, the butterflies involving no tw iddle factor multiplications are classified into

separate stages and hence it is easy to disable the m ultip lier across those stages.

Total number o f butterfly operations where m ultip lier remains enabled in conventional

NFFT algorithm based architectures = — log2 N (4.9.1)

Total number o f butterfly operations where m ultip lier remains enabled in Twiddle factor

based FFT architecture = ( - y lo g j N ) - [ ( - y - l ) + (N-1)]

( — log2iV - — - 2 ) 2 = 2

(4.9.2)

Equations (4.9.1) and (4.9.2) give a comparative analysis o f the savings we attain in

terms o f number o f times the multipliers get utilized effectively. A graphical

representation is shown in figure (4.3).

Comparison of Number of times multiplier gets utilized

6000<uD) 50000)o.

4000

3000

2000o«

. Q£3z

1000

Number of FFT Input data points

- Number of times multipliers get utilized in conventional FFT algorithm based architectures

-Number of times multipliers get utilized in Twiddle factor based FFT algorithm based architectures

Figure 4.3; Graphical comparison fo r m ultip lier operation usage between twiddle factor based architecture and conventional D IT or DEF based architectures

7 2


4.10 Design Issues and Challenges

The previous sections described the computational complexity, operational speed in

terms o f number o f clock cycles and also the inherent delays present in each o f the

architecture blocks. We observed that pipelining a structure could result in a shorter

execution time than a non-pipelined structure as pipelining involves every block

operating at every clock cycle (when one block computes on one set o f data, its preceding

block computes on the next set o f data while the succeeding block works on the previous

set o f data that was earlier fetched. That is i f block (x+1 ) works on data (y+J), then at the

same clock cycle, block x, which precedes block (x+J) works on data (y+2) while the

succeeding block |jc+2 j works on data y). The non-pipelined structure on the other hand,

involves enabling only one block at a time while the succeeding as well as the preceding

blocks remains idle t i l l the current block completes its execution.

In case o f our architectural design, we have managed to pipeline every block except

fo r the R A M Address generation block. The main challenge involved in pipelining this

block is the presence o f nested loop structure (a loop w ith in another loop). The presence

o f nested loops in the address generation logic design makes pipelining a non-pragmatic

and extremely challenging task. Since nested loops involve transfer o f control back from

the outermost loop to the inner most and then all the way back to the outer loop, keeping

track o f the current values o f registers and variables in each loop becomes tedious.

Consequently, we were not able to pipeline R A M Address generation block while all

other blocks could be pipelined. Non-pipelining o f this block resulted in its inherent delay

being apparent while FFT butterflies get computed. From section 4.2, we know that the

delay caused by R A M Address generation block is 2 to 4 clock cycles o f which 2 is the

73


most dominant delay factor. We also saw that all the other blocks designed and

implemented in our architecture do not involve any delay is the most dominant delay

factor. We also saw that all the other blocks designed and implemented in our

architecture do not involve any delay. Hence, we encounter 2-clock pulse delay fo r every

butterfly computed.

NTotal number o f butterflies for an W point FFT = -^ lo g ^ N (4.10.1)

Hence for N = 1024, total number o f butterflies computed = 5120 (4.10.2)

From section 4.7, the clock frequency utilized in our architecture is found to be 100 M Hz.

Hence the tim ing w idth o f each pulse obtained by taking the inverse value o f the clock

frequency = 10 nanoseconds. (4.10.3)

Hence assuming 2-clock cycle in calculating every butterfly, it takes 10240 clock cycles

to calculate all the 5120 butterflies. Since each clock pulse is o f 10 nanoseconds width, it

takes a total o f 102400 nanoseconds or 102.40 microseconds to compute 1024 input FFT

using twiddle factor-based algorithm. Since the R A M Address generation block cannot

generate address every clock cycle, the data-path that basically requires data from those

corresponding address locations to compute the butterflies w ill not be able to do so. Since

the R A M Address generation block could not be pipelined, the data-path follows suit

though it is simple to pipeline this block. Hence in addition to the 2-clock cycle we

encounter due to R A M Address generation block delay, we also have data-path delay

included.

The data-path delay as discussed earlier is determined based on whether complex

multiplication is involved or not. I f complex multiplication is involved, then based on

figure (3.16), we have 5 stages o f computation and hence the delay is 5 clock cycles. In

74


case o f no complex multiplication involvement, then the delay is only 2 clock cycles. In

addition to these delays, there are also delays due to resolution functions that add up to

additional 2 clock cycles fo r every butterfly computation. The total clock cycles required

computing a butterfly involving no complex m ultiplication is given as 6 clock cycles that

includes the R A M address generation delay, data-path delay and additional delays due to

resolution functions.

The total number o f butterflies for N = 1024 which involve complex multiplication is

1534.

Hence, the total number o f clock cycles required computing 3586 butterflies

= 1534 * 6=9204 (4.10.4)

Similarly, the total clock cycles required computing a butterfly involving complex

multiplication is given as 9 clock cycles that includes the R A M address generation delay,

data-path delay and additional delays due to resolution functions.

The total number o f butterflies for N = 1024 which involve complex multiplication is

3586.

Hence, the total number o f clock cycles required computing 3586 butterflies

= 3586*9=32274 (4.10.5)

Combining equations (4.10.4) and (4.10.5), we can determine the total number o f clock

cycles involved in computing 1024 input FFT based on tw iddle factor architecture is

4 I47&

W ith a 10 nanoseconds clock pulse width, the total time taken to execute 1024 points

input FFT based on tw iddle factor architecture is calculated to be 414.78 microseconds.

75


4.11 Architectural Salient Features and Drawbacks

The biggest advantage the tw iddle factor based FFT architecture has over other

conventional D IT or D IF based architectures is in terms o f memory access which

ultimately results in power savings.

(1) Our architectural design has 10 times lesser Read Only Memory (ROM) access

when compared to D IT or D IF based architectures.

(2) The Random Access Memory (R AM ) access is reduced by as much as 25% in our

design as compared to other previous architectures.

(3) Our design has 1.45 times lesser number o f operations taking into account the

total number o f memory accesses and multiplication computations.

(4) Clock gating where unused blocks are disabled in order to reduce unwanted

power consumption while the blocks remain idle is utilized in our design w ith the

help o f resolution blocks that enable and disable different blocks.

(5) As a result o f the reduction in the number o f operations performed, fo r a 1024

point FFT based on tw iddle factor design, power reduction up to 31.25% in terms

o f memory access and arithmetic blocks usage is expected.

The main drawback that this design suffers is in the time duration taken to compute 1024-

point operations. This is mainly attributed to the non-pipelined nature o f our design.

When compared to our in itia l target application o f OFDM transceiver which necessitates

a 7024-point FFT to get computed w ith in 51 microseconds. Hence we can observe that

our architectural design despite its advantages in terms o f computational complexity and

power savings, is 8 times more time consuming in computing 7024-point FFT as required

by the standardized O FDM transceiver.

76


CHAPTER 5

CONCLUSIONS AN D SUGGESTIONS FOR FUTURE W ORK

5.1 Summary o f W ork

Our entire work focuses on one-to-one mapping o f the Twiddle factor based FFT

algorithm described in [1] [10] on to hardware blocks so as to extract the maximum

benefits derived from the algorithm. Based on the algorithm, various logic blocks were

designed and simulated using VH D L. Memory savings in terms o f memory access have

been mapped on to the blocks from the algorithm. The architecture makes minimum

utilization o f arithmetic hardware blocks, especially the multipliers. Though this

architecture is advantageous in terms o f power savings when compared to other previous

architectures, the major drawback it faces is in terms o f the time it takes to complete its

required execution. This is attributed to the non-pipelined nature o f its design. Thus

7024-points FFT can be computed using the Twiddle factor based FFT architecture in

414.78 microseconds w ith up to 7.45 times lesser number o f operations resulting in

power savings o f up to 31.25% is expected. A ll necessary blocks have been designed and

simulated in Active-H D L 6.3.

5.2 Suggestions for the Future

As seen in Chapter 4, the main factor that delimits the efficient usage o f the twiddle

factor based architecture is the execution time. Consequently, we have explained that

77


pipelining the R A M Address generation block, which automatically enables pipelining o f

other blocks like the data-path is the most important way o f reducing the tim ing problem.

An efficient way o f pipelining nested loop structures need to be developed in order to

reduce the large execution time fo r processors.

It is a well-known fact that pipelining make architecture attain their fu ll efficiency and

help achieve a higher throughput though w ith some in itia l latency. One other important

method to reduce the delay o f operation is to utilize high-speed hardware blocks,

especially the arithmetic blocks like the adders and multipliers. The type o f complex

m ultip lie r used for our data-path is shown in figure (3.13). It can be seen that this type o f

complex m ultip lie r design uses more addition operations than real multiplications. Hence

usage o f fast adders becomes a necessity in-order to maximize efficiency and increase

throughput. From [9], it can be inferred that fo r arithmetic adders involving b it widths o f

24 or higher. Carry Look Ahead (C LA ) adders are probably the fastest when compared to

Ripple Carry Adders (RCA) or Carry Save Adders (CSA) and many others. Hence, it is

highly recommended to design C LA adders while seeking improvement in the current

design. For implementing real multipliers. Array Based multipliers that are w idely and

most commonly used over various logic designs can be utilized.

Power analysis to estimate the power efficiency o f the tw iddle factor based

architecture w ith all the above said improvements implemented should be carried out

using power estimation CAD tools. This would give an accurate picture o f the

effectiveness o f the tw iddle factor based algorithm [1] [10] over conventional D IF or D IT

algorithms.

78


Further savings in terms o f memory access, which is the main highlight o f our design,

can be effectively obtained by using smart memories like Cache and flash memories.

These memory units, unlike the Random Access Memory are small in size and are

comparatively much more efficient in terms o f accessing time. As these memories are

smaller in size and are placed closer to the processor than the Ram, which consequently

enables them to be much quicker than conventional large-scale memories. These

memories help in reducing frequent R A M or ROM access and in turn increase the

effective power saving capacity o f our design.

79


APPENDIX

V H D L DESIGN OF VARIOUS ARCH ITECTU RAL BLOCKS’

The V H D L design and implementation o f various Twiddle factor based FFT architectural

blocks described in Chapters 3 and 4 are pictoria lly represented in this Appendix.

A

K *

5 f '

# . ; e %

TTT*§•

Figure A l l : V H D L block representation o f (log2N- l ) stages R A M Address generation block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa and qb

80


(Ik elk ena" ena

U1

alier_tw o_w riteJn_tenipjegi-after_eveiY_writeJn_temp_reg^

mx_signal_oul3Jn_from_datapalhS—U2

log_N_stage_signal

N blockl_it*a(ion_of_block2

ck Wock1_sub_bk)ck1_inMw

ck no _o f_s t^s

Wue_Wn_logt_block1_out

log N stage # a l inI Isigal_alter_at_block1Jn

r

new^leg n bleckf

■W-

N Uock2_inifeitioti_of_sub_yoci(1

Wock1_in*aNon_tmWkÊomptMMn_block2 \ê_from _block2

ek

ena

^b_btack1Jnili3ting_block2

new_logn_block2

afl'îr two write in RAM

Ü7'

bk)ck1J#ak)n_in no_of_conûtafai5

eft s ^ W from blocki to blocki

cit sit_block2_itftiËion

ena

no_of_stages_in

signttf_aftef_sub_block2

-newJogn_blockFsub_blQck1

U3

N qa*lr1(5J)

after_eveiy_mte_m_temp_reg c^jogn l

0W_W_M@e_m_temp_reg < _res<Mon_1 elk signal_ffom.sii)_block2_to_sub_b!ock1

cir

ena

mx_ _ai3_n_ ram_datap8ftno_of_cwrptatk)os_in

no_of_stages>

sii)_block2^lnftiaKwi_B5

i_lrom_loyi_Uock1_in

- Ü 5 - NEW LOGN BIO CKI sA t i look:

N qa_bgi2

8ftef_lwo_wrie^ln,RAM qa<kk2ff:0)

block2_mWw_b_sub_bloek1_in qb_logi2

elk qb_fesolitfon_2

ck signd_fro(n_ai)__Wock1_to_blak2

tiMjigTÿ_oüt3_m_lrom_(btap!ih no_of_compttetion_block2_Bi

vakje_frwn_Uock2_in

value_frofn_loÿi_block1 _in

lqa_logn2

new

qbjogn

qaddr(5:0)

U6

qa(ldr(5:0)

qb_bgn

qaddfl_in(5:0)

qaddr2_kt{5:0)

qb_logn1_in

qb_logn2>

(ÿ)_re»lutoi_1_in

qb_fesolufcn_2_ii

-new-

bk)ck2_ïilfetion_fojiè_UiKR _temp_i •' ck Write_temp_fe9 '

ck

ena

mx_signal_oiA3_in_from_d@hyalh

sii)_b!ock2_iràiation_in

value.from_logn_block1Jn

READ m i E RESOLUTION FOR TEMP REG ney,

NEW_LOGN_BlOCK2_cub_blocl( i :(C)ALDEC.Inc ; 2260 Corporate Circle Henderson, NV 89074

RESOUmON_FOR_q

-#R eadJem p_reg-#W riteJem p_reg

Created;

Title:

6/16/2005

No Title

The Design venficafion Company

Figure A -12: V H D L block representation o f the (log2N)'^ stage R A M Address generation block based on Tw iddle factor decomposition algorithm generating 70-bit R A M address sequences qa_logn2 and q b jo g n and 6-bit Temporary register bank address qaddr

81


clrf?<- cIr clkB elk

readJjm agE>-,read_1Jm ag w rile J J m a g E - w r i le jjm a g U1

mad_1_imag*—I

address from deeoder#-dalajn_lmaginary{31:0)®-

write_1Jmagi

elk dia_out_1Jni8g(31:0)

dr

cs_1_rajg

dala_m_1_mag(31fl)

reBd_1_m g

<— ©BusOutputO(31:0)

ram_celljmag_1

Figure A-I3: V H D L design o f a 7-word R A M cell fo r storing imaginary part o f data.

c IrL cir elk! elk

re a d _ 1 _re a lC - re a d _ 1 je a l write 1 r e a l# - , write 1 real U1

read 1 re a l© —

address_from_decoderl data in

write 1

' elk daia_out_1(31:0)p

■ cir

' cs_1

- cWaJn_1(31;0)

•• r e a d j

•write 1

-D data_output(31:0)

ram cell real 1

Figure A-I4: V H D L design o f a 7-word R A M cell for storing real part o f data

82


cIrZclkP'

read_1_real- write 1 reaL

cirelkread_1_rea! write 1 real

de ena 2x4 real

U5^

elk de_oul_2x4_real(3:0) ■

cir

de_ena_2x4_real

de Jnp_2x4_real(1:0)

decoder 2x4 real

BUS1549(0)

BUS 1549(1)

BUS 1549(2)BUS1549(2)

de_lnp_2x4_real(1:0)

Figure A-I5: V H D L design o f a 2x4 Decoder block choosing one among four outputs based on a 2-bit input. The input port is given as de jnp_2x4_rea l(l:0 ) and output ports are represented by de_out_2x4_real(3:0)

U12 rom__designJmag_new

rom_imga_out_real(31:0)

U9

— #^BusOutput1 (31:0)U10

mem_oonli^0^_controlier1 — >

rom_design real new

elk mem_contr_out(5:0)

cir rom jnitiatlon_signal

mem_contr_ena

rom_sig_in

elk

mem_eontr_ena

rom_address(5;0)

rom initiation

signal_from_block3Jnput

rom_real_out(3:0)

elk

mem contr ena

rom_address(5:0)

rom initiation

signal_from_bloek3Jnput

romJmag_out(3:0)

Figure A-I6: V H D L design o f a 64-words ROM and a decoder along w ith interconnects outputting 32-bits real and imaginary tw iddle factors

83


re a d J J m a g ^ V ea d _1 _ im W e _ 1 _ im a g # f W te_1Jnr

BUS11

read_1_itnaglwrite_1Jmagl

d a la _ in jjm a g (3 1 :0 )# —

de_inp_2x4Jmag_2(1:0)l

de_ena_2x4_imag

de_inp_2x4

le_inp_2x4Jmag_1 (1:0)© —

U4_ _ _ _ _ _ _ _ _ _

elk de_oul_2x4_iiiiig(3:0)

dr

de_ena_2x4_lmag

de_inp_2x4_mag(1:0)

-4«US1t7Ç|

IIUS117(2)

BJS117(3)

decoder_2x4Jmag

cfc dahi_CKjl_1_imag(31:0)

cir

data_in_1_ima01:O}

de_ena_2K4_imag

de_inp_2%4_im@g(1i])

de_inp_2x4îmag_1(1fl)

r«8d_1_imag

w te 1 imag

\16 words RAMJmag2\ — #data_oul_1_imag(31:0)

cl data_oj!_1_imag(31;0) ck

de_ena_2K4_fnagds_inp_2x4_ima9(1Æ)

de_inp_2)i4_iïiag_1(1 l)

read_1Jmag

wie.1_»nag

\16 words RAM imag2\ U3

e l d*i_oüt^1_imag(31;0)

ctfdata_in_1_imag(31:0)

de_ena_2%4_mg

de_mp_2x4_mag(1Æ)

de_inp_2»4_BTiag_l(1:0)

read_1Jmag

\16 words RAM imag2\ U5

elk dat!_d td j_iirag(31:0)^

de_@na_2x4_imag

(k_inp_2x4_im@g(1:0)

de_Hip_2x4_imag_1(1fl)

rflad_1Jmag

In}flte Circle ,NV 89074 The Design Verification Company

11/4/2004

No Title

\16 words RAMJmag2\

Figure A-I7: V H D L design o f 64-words R A M using four 76-words R A M cells and a decoder along w ith interconnects outputting 32-bits real and imaginary data

8 4


a_realjnp(31:0)

b_realjnp(31:0)

adtH ena

jmagLinp(31:0)

fiag_inp(31;0)

add2 ena

sub1 ena

multi ena

B—I sub2 ena^

mull2_ena

B—adds ena

c_reai(31:0)® -

Isub3 ena ^

U1

a_addripi:0) add1_oi^31:0)

a d d ijn a nK_dM3path_1

b jd d r lp fO l

elk

elf

U2 adder1_ne,v_

a_aM2(3l:0) ailil2.i»jl(3l:0)

atkj2_ena

aiM2(31:[lt

elk

cir

U3adder2 new

i_subr1(31:0) sub1_Qut(31:0|

iju b rip i.O )

elk

elf

subi eng

U4Subtfactor1_new

i_subi2pi:0) siA2_out(31:0)

)jmm)elk

cir

sub2 ena

s # a c to f2 new

multS ena sub4 ena add4 ena adds aiaa_real_o(Jt(31:0)

multiplier1_nw

a_!Tuft1pl;0) multl_out(31:Q)

eï(

cir

multi ena

a_mtA2p1:Q) rnjlt2_out(3!:0)

_n#2pi:0) elk

cir

mult2 ena

buf3_sub4pi.O) sub4_out(31;0)

bul4_stA4pl:0)

elk

cir

sub4 ena

subtractff'l m -

U7 r2 new

a_ad(Jf3pi:0) add3_out|31;0)

add3_ena

)_acfdr3pi:D)

elk

cir

U8 adder3_ne¥.'

:_subi3pi;0) sub3_out(31:0]

c_sutff3pl:D)

elk

cir

subS ena

Jtle:

a_mult3(3!:0) nul3_oiiÿ31:D)

)_mult3pl:0)

elk

elf

multS ena

U11ajmaa_out(31:0)

ad(M_ena add4_(M(3l:0)

buf3_add4pi:0)

buf4_ald4pl:0)

adder4 nen

bjmag_outjroup1(31:0)

bjmag_oiyroup2(31:0)

b real4315-

addSjna add5_ou^l:0)

buf5_add5pi:0| rn x ja la p a th j

elk

cir

stA4_add5(3!;0)

adderS_newJb_real_out_group1(31%

b I

bjinag_oiyroup3(31:0)

(()ALDEC.Inc 2 !60 Corporate Circle H ïnderson, NV 89074 | Qgjg \j f Company

eated: 11 /1 /2 0 0 4

No Title

clki dk

diL di

LDECÜ14-

dk m x jig n a lju t i

cir m<_S!gnal_out2

iTO_datapatli_IJn m<_s!gnal_otiÛ

rn(_dalap#_2jn

inx_si(

—Kmx sii

Cl i h t r o r t r r l nm.'inx_datapatli_resolution_nev/

Figure A-I8: V H D L design block showing the data-paths for group 1, group2 and groupS. The output signals fo r each group are represented by b_real_out_groupX and b_imaginary_out_groupX w ith X representing individual group numbers.

8 5


add2_enalaiM1_ena@U2

cadeim_enaB— ^

fi(Jnp1(31:0)B------ -ix_inp2(31:0)&------ *-

tlk d»m_ouH(31jl|

dr dmx.oul2(31C)

dsrrax.cs deim_oii3(31:0)

deniitr.ena dm .oU4(31:0)

der«iJX_M(3W) demux_inp2(31J)

demultiplexerja_ql)

add3_enal---------------I enal

cjmag(31;0)^c_real(31:0)i>-

addS eti!#-

mull1_enal muKend mull3_enal sub1_enal sub2_enal sub3_enai aib4 end

U1

Ia.imag.it()(3lil) a_irrtag_rrirl|3M)

a je d jp p U ) a jea l.ori(3 IJ)

addt_ena b,inag.mrl(31:0)

add2_erra b.real.ota(3l:0)

add3_sna mxjgtral_oiiM

addd.eria mx_signal_(jrjl2

add5_ena n t* j^ _ o rd 3

b_irra9.B|)(31:0)

b_real_irrp(3W)

tJmag(31J)

c_raal|3t:ll|

dk

dr

mriltl_arta

i!Ë2_«ria

mi#3_etra

snbt_e«a

sub2.ena

5tÈ3_m

snb4_atta

U3

ck dala_miix_(ijt1(31d))

dr data.inrK.oifi(31J)

(telrr_mi»_ts

data_mn(_ens

date_mtK>1(3li))

dala_mrrx_inp2(3lll)

dah_miri(_inp3(3H))

dala_intix_in|)4(3l:0)

rmdtiplexefjWapat

■#dala_niux_out1(31:0)■#data_mux_ou12(31:0)

: ena

lmx_signal_out1

-#mx_sigral_oul3

Oal bcompWe

Figure A-I9: V H D L design block showing the resolution function designed fo r the datapath to choose between the three data-path groups and also for determining the Read and Write operations o f the R A M block and other resolution blocks.

86


d#Ctlr

ig afe datajntojam US

N a_tiiljM_tes#ig_m_m1

c l b_outJoi_resoMng_in_m1

ck qa(5:0)

era qb(5:0j

sigjfler.tla laJnto jan i tom.sig

signal_froniJock3J

signaljmm_blocl(3J

mm_sig#-'

qb(5dl)@-ijknim1ni2_resoiiition_qa_int1ostdlogjcJognstage

# -

a_MiljM_reséirig_m_m1jesttïiiiu*_6ra

c l cs_irax_erajeed

clics_nnix_enajee(tiack

ena

m_datapalk_(mt

lead jS erJom jam

vnJe alec tom lam

Cl imix enaU3

prog_contr_resolytion_niüx_os_ena

U4

" a_oUJor_tesolviiig_m_m1 je c é e muxjna

■■ b_oiit_foijesolmg_in_tii1_t6solï6

- c l

-cil

a_mdJotjesotiing_m_mteaàeWaeJe_iam

e- c l wle_ïalue_lo_ram

■*-cli

*-ena

mx.datapait.ciil

+—#fead_valueja_ t—*wrile vAe lo

ram read writd

-— Ireadjfterjromjam

"write after from ram

ena

Figure A-IIO : V H D L design block showing R A M Address generation block with resolution functions that aid in enabling and disabling various signals at different instances o f time

87


d#tdr

U1

m ux_clr#f)m ux_clrIt. mem contr-ena-

mem contr ena '

n1m2 eso lu liqnja jiiltostd logicjop i li c

N 8_oul_foi_resokwg_m_m1

ck b_oifl_foM6s<A'mg_m_ml

cir qa(5:0)

ena # : 0 )

yg_after_data_into_ram rom_sig

siÿwl_fiom_block3_1

m-ck rrw(_out(5;0)

iwx_clr

mux_cs_fromj€soMion

mux_8na_ftomjesolutiofi

m ux>p1(5;0j

0:1

m u ltip le x erja jb

datajn_1(31:0)

datajn_1_imag(31:0)

a_outJofjesofkm a.m _m 1jesotem ® _ena

ck cs_mux_ena_f@ed

CËes_mux_ena_fe«fcack

m>;_datâl}i_out

read_aftef^from_ram

write after ftom ram

prog_contrjesolution_mux_cs_ena

,A(::3),A(4;5

tw-a_wl_for_resolmg_m_m1_reaoke m u x jn a

b_oU_fot_ies(^ng_m_ni1 resotve

c_out_after_two_reads_iam

elk

cir

mx_datapath_out

■ prog_contH@ekrti6it mux enable

tm-

signal_tojnpdl_datajnto_ram_initi%

c_CHjt_f6rjesovlrîg

read_after_for_tesdving

dala_in_1{31;0) sig_aftei_dala_from_fan

dala_in_1_imagf31:0>wie_aft6r_forj6sotving

rmrx_orftput{5:0}

read_1

signaljo_lnput_data_hto_ramjratialf/

mem contr e n a # —

U9

(C)AÎÎB1

U3

a_crtrt_for_resolvittg_m_m!ea*9Wae_toj»ii

ck writejà ie J o _ r m i

ck

ena

mx_datapalh_out

prog_contrjesclution_ramjead_ttrite

^mux_datapath_out

-W-

c k data_outj(31:0;

ckd a ta jn j(3 l:0 )

de_ena_2x4_real

de_bp_2x4.real(1;0)

de_inp_2x4_ieai_1{1:0)

de_kip_2x4_reaL2(1:0)

Ï64 WORDS RAM2i

I.DECCorp

ttendersOT

-W-s$'

Af.j).

’■ Î

c k data_out_1_imBg(31Æ)

ck

data_ii_1Jmag(31:0}

de_ena_2x4_imag

dejnp_2x4_ima9{1:0)

de_lnp_2x4_ima9_1(1:0)

de_B^)_2x4_imag_2i1f))

read_1_imag

wrile_1_imag

data_out_1_lma( (31:0)

de ena 2)

(31:0)

4_imagWORDS RAMJmag2\

de ena 2x4 i

U12

Inc jra te Oifcl , , N V 8 #

elk m6m_conlr_outi5:0}

cli iom_inftiation_signal

mem contr ena

.rom_sig_in

20

ck rom_imag_out{3:0)

mem_contr_Kia

iom_address(5Æ)

rom_kiSinsi?id_from _block3_ii^

U10

ificatio 1 C

i t i e

rom_imag_out(3:0) ' i)ul(3:0)

"—Belk rom_real_out(3:0)

mem_conti_ena

rom_adèessi;5Æ)

romjnitiatton

signalJom_bkick3_inpift

rem-desigiyeal_new

rom_design_imag_new

inemory_conlroller1

Figure A - I l l : V H D L design block showing R A M Address generation block, the R A M blocks, the ROM memory eontroller block, the ROM blocks and all the other interconnects


A *C«naU1

nnK_dr#t,fnuL*It 'ff ie m corjir ena

mem contr ena

IJkmm 1m2_resolLttioi Lqajnttostdlogicjognstage

ü ttr

3dd1_ena data_mux_Mi1(31:0)

add2_ena data_mii<_(wt2(31:0)

add3_ena TO_signaI_ou{l

adcM_ena fre_signaUut3

addSjna

c.imagplOJ

c_iB3lp1:0)

elk

cir

dat3_rrux_«ra

demi(_cs

derrwx_ena

dema_inplp1:0)

detnix_inp2p1:0)

mult!_ena

rrult2_ena

multSjna

siÈI_ena

si4i2_ena

si*3_ena

s # ena

(tatapathjejo lutionjonip let

N a_out_for_tesoMn9_m_m1

elk b_out_for_iesolving_m_iTi5

cir qa(5;0)

qb(5:D)

fter_data_lnto_i3m rom_sg

signat_Wm_bbck3_!

dk )TiKjut(5:0)

mux_cs_fit)m_resolution

t7ux_enaJrom_resoton

mux_mpl(5:0)

mK_ïç2(5:0) A 5:1

multiplexer_qa_qt:'

mx_signal_out3

Wr3_oiflJor_resolving_m_ml_reso}y8:s_mux_ena

elk cs_mLK_8na_feed

cir

es mux ena feedback

mx_datâtti_out

read_êr_from_ram

wnte_êr_ffom_ram

rtcfcowtrj olutionjririi'te ^

,A(2:3).A(4:5)

-W-8_ait_for_RSolmg_m_ml_resQlve m jx jn a

b_oul.for_(ESokmg_m_ml_fesrtve

c_out_after_twD_î5ads_ram

elk

cir

mt_d3fapath_out

-profj ot-resolulionj

« -

C—

signalJoJnpptJataJntojamjnitially

c_oul_fcr_resoving

read_after_fof_resolving

i_in_ipi:0) sig_afler_dafajrüfîi_r3m

i_in_f Jmag(3l;0) wte_afterJof_iesoMng

nux_oi5iuq5:0) read, I

s^nalJoJnput_d3ta_into_r3m_initially

lUX

)EC. Inc brporate dirclersqnpNVetW

d: 5/17/21

|U3â_oulJof_resoMng_m_mt_ie8bJyslue_to_r3rn

ck wite_value_îo_ram

cir

ena

nt<_datâtb_out

progL,contjesolution_ram_rpad_wri' a

4 3 6 -

A((

elk data_oia_ipi:0)

cir

data_in_1(31:0)

de_ena_2x4_real

de_irç_2x4_real(l;0)

de_inp_2x4_real_l(!:0)

de_inp_2x4_ieal_2(l:0)

,5

Mite_l

\6f| WORDS RAMSU

W-

Üiili i i i lA O I)

elk d3ta_out_tj

cir

data_in_1_imaB(31:

de_en3_2x4jrT0g

np_2x4_knag(1

np_2x4_imag_'

dejnp_2x4_irnag_;

lead.l.lmag

w!te_l_m g

\64 WORDS R

U12

l ie Design Verjl

mem contr i

elk mem_ccmtr.ouq5:G)

cir romJnidation_signal

mem_contr_ena

rom_sig_in

elk ramjmag_ou^3;0)

mem_corW_ena

fDm_3ddress(5:0)

rom^nitiafew

;ignal_fmm_block3_iivutU10

rom jesigi

memory_controller1

lac(Dm_real_out(30)

mem contr enanw ■iôm_address(5:0)

tcm_inibaüon

signal_from_bk)ck3_inpul

_ jd f lc in n r a a l n a u i

Figure A-I12: V H D L design block partia lly showing interconnection between all the blocks o f the Twiddle factor based FFT architecture.

89


BIBLIO G RAPHY

[1] Yingtao Jiang, T ing Zhou, Yiyan Tang and Yuke Wang , “ Twiddle-Factor-Based FFT

A lgorithm w ith Reduced Memory Access” , In the Proceedings o f the International

Paralle l and D istributed Processing Symposium, IEEE, 2002.

[2] Bevan M.Baas, “ An Approaeh to Low-Power, High-Performance, Fast Fourier

Transform Processor Design” , Ph.D. dissertation, Stanford University, Stanford, CA,

February 1999.

[3] Smith, Steven W ., “ The Scientist and Engineer's Guide to D ig ita l Signal Processing” ,

2nd edition. San Diego; California Technieal Publishing, 1999. ISBN 0-9660176-3-3.

[4] Alan V.Oppenheim and Ronald W.Ssehafer, “ D igita l Signal Proeessing” , Oetober

2000. ISBN-81-203-0532-9

[5] Brown, J. W. and Churchill, R. V., “ Fourier Series and Boundary Value Problems” ,

5th ed. New York: M eG raw -H ill, 1993.

[6] Brigham E.G., “ The Fast Fourier Transform and Its Applications” , Prentice-Hall,

Englewood C liffs, NJ, 1988.

[7] Julius O.Smith III, “ Mathematics o f the Discrete Fourier Transform (DFT), w ith

Music and Audio Applications” , W 3K Publishing, 2003. ISBN 0-9745607-0-7.

[8] John Proakis, D im itris Manolakis, “ D ig ita l Signal Processing - Prineiples, Algorithms

and Applications” , Pearson, ISBN 0133942899.

90


[9] Jan M.Rabaey, Anantha Chandrakasan and Borivoje N iko lic, “ D igita l Integrated

C ircuits-A Design Perspective” , Prentice Hall, Second Edition, 2003. ISBN 0-13-

090996-3.

[10] Yiyan Tang, Yingtao Jiang and Yuke Wang, “ Reduce FFT Memory reference for

low power applications” . In IEEE International Conference on Acoustics, Speech and

Signal Processing, 2002, Volume 3, 13-17 M ay 2002 Pages(s).TII-3204- III-3207 vol.3.

[11] Byerly, W. E., “ An Elementary Treatise on Fourier's Series, and Spherical,

Cylindrical, and Ellipsoidal Harmonics, w ith Applications to Problems in Mathematical

Physics” . New York: Dover, 1959.

[12] Carslaw, H. S., “ Introduction to the Theory o f Fourier's Series and Integrals” , 3rd

ed., rev. and enl. New York: Dover, 1950,

[13] Davis, H. F., “ Fourier Series and Orthogonal Functions” , New York: Dover, 1963.

[14] Dym, H. and McKean, H. P. “ Fourier Series and Integrals” , New York: Academic

Press, 1972.

[15] Folland, G. B., “ Fourier Analysis and Its Applications” , Pacific Grove, CA:

Brooks/Cole, 1992.

[16] Groemer, H., “ Geometric Applications o f Fourier Series and Spherical Harmonics” ,

New York: Cambridge University Press, 1996.

[17] Oppenheim A.V., Schafer R.W., and Buck J R., “ Discrete-Time Signal Processing” ,

Prentice-Hall, 1999.

[18] Smith S.W., “ The Scientist and Engineer's Guide to D ig ita l Signal

Processing” , (http://www.dspguide.com/pdfbook.htm), California Technical Publishing,

San Diego, 2nd edition, 1999.

91


http://www.dspguide.com/pdfbook.htm

[19] Retrieved from “ http://en.wikipedia.org/wiki/Discrete_Fourier_transform” .

[20] Sevan M . Baas, "A Low-Power, High-Performance, 1024-point FFT Processor."

IEEE Journal o f Solid-State Circuits (JSSC), pp. 380-387, March 1999.

[21] Sevan M . Baas, "A 9.5mW, 330/tsec, 1024-point FFT Processor," Proceedings o f

the 1998 Custom Integrated Circuits Conference (CICC), Santa Clara, CA, USA, 11-14

M ay 1998.

[22] Wen-Chang Yeh, and Chein-Wei Jen, “ High-Speed and Low-Power Split-Radix

FFT” . In IEEE International Conference on Signal Processing, Vol., 51, No.3, March

2003.

[23] He .S. and Torkelson, “ Designing pipeline FFT processor fo r OFDM

(de)Modulation,” in Proc. IEEE URSI International Symposium o f Signals, System and

Electron, 1998, pp.257-262.

92


http://en.wikipedia.org/wiki/Discrete_Fourier_transform%e2%80%9d

VITA

Graduate College University o f Nevada, Las Vegas

Bhaarath Kumar

Home Address:1600, E. Rochelle Ave., Apt 42 Las Vegas, NV-89119.

Degrees:Bachelor o f Engineering, Electronics and Communication Engineering, 2002, University o f Madras, India

Thesis Title:Design and Implementation o f a Fast Fourier Transform Architecture using Twiddle Factor Based Decomposition A lgorithm .

Thesis Examination Committee:Chairperson, Dr. Yingtao Jiang, Ph. D.Committee Member, Dr. Emma Regentova, Ph. D.Committee Member, Dr. Eugene McGaugh, Ph. D.Graduate College Representative, Dr. A jit K.Roy, Ph. D.

93


Documents

Design and implementation of a fast Fourier transform