s.Magesh kumar DECE,BTECH,ME (ASAN MEMORIAL COLLEGE OF ENGINEERING AND TECHNOLOGY)

A Combined SDC-SDF Architecture For Normal I/O Pipelined Radix-4 FFT

S.MAGESHKUMAR,

Department of ECE,

Asan Memorial College Of Engineering And Technology,

[email protected].

Abstract

We present an efficient combined

single-path delay commutator and

multi-path delay feedback (SDC-

SDF) radix-4 pipelined fast fourier

transform architecture.which

includes SDC stages,and

one SDF stages .The SDC

processing engine is proposed to

achieve 100% hardware resource

utilization by sharing the common

arithmetic resource in the time-

multiplexed approach,including both

adders and multipliers is reduced to

compared with

for the other radix-4 SDC-SDF

architecture .in addition the

proposed architecture requires

roughly minimum number of

complex adders and

complex delay memory 4N+3.0

.

Intex Terms-Fast Fourier Transform

(FFT),pipelined architecture ,single

path delay communicator processing

elements (SDC PF).

Introduct

ion

Fast Fourier Transform(FFT) has

played a significant role in digital

signal processing field,especially in

the advanced communication

systems,such as orthogonal

frequency multiplexing

mailto:[email protected]

(OFDM),and asymmetric digital

subscriber line .all these systems

require that the FFT computation

must be high throughput and low

latency.therefore ,designing a high

performance FFT circuit is an

efficient solution to the

abovementioned problems,in

particular the pipelined FFT

architecture have mainly been

adopted to address the difficulties

Due to their attractive .such as small

chip area .high throughput ,and

power consumption.

To the best our knowledge ,two

types of pipelined FFT architecture

can be found in this brief: delay feed

back (DF) and delay commutator

(DC).further according to the

number of inputs data stream paths,

they can be classified in to multi

path (M) and single path (S)

architectures.the two classification

form four kind of pipelined FFT

architecture,are often adopted when

the throughput requirement is

beyond the theoretical limitation that

the single-path architecture can offer

at a given clock

frequency.however ,they require

concurrent read (write) operation for

the multipath input (output)

data .therefore ,single –path (s)

architecture could be appropriate in

some cases when the system cannot

ensure concurrent

operations .however ,the arithmetic

utilization is relatively

low,compared with 100%

utilizations of the existing

MDF/MDC architecture.in also

achieve 100% multiplier utilization

by reordering the inner data

sequence.

For single input data stream,the

conventional radix-4 SDF FFT

architecture requires

complex adders and

complex multipliers,where N is the

FFT size.both chang [11] and Liu et

al.[12] present the novel SDC

architecture to reduce 50% complex

adders by reordering inner data

sequence.however the utilization of

the corresponding comple multiplier

still remain 50%for the both

architecture.we therefore study

whether the complex multiplier unit

can be modeified to achieve the

100% utilization.

In the radix-4 FFT

architecture ,there is a common

observation that one half data (sum

part of butterfly operation) do not

involve complex multiplication (W

) at all. While other half

(difference part)indeed involves

complex multiplication (W

.hence it has the opportunity to

achieve the objective that reduces

the arithmetic resource of the

conventional complex multiplier by

a factor of 2,leading to 100%

utilization .it is ideal for two

consecutive complex input data to

contain a complex number, which

needs to execute complex

multiplication ,if so w can minimize

the reordering memory requirement

while achieving the above objective

that reduces 50% the arithametic

resurce of complex multipliers.

Fortunately,the improved SDC

architecture can be produce the sum

and corresponding difference results

of a butterfly operation in

consecutive two cycles .the sum part

is directly passed to the next

stages,while the difference parts

need to execute complex

multiplication before passing to the

next stage .therefore,the SDC

architecture is ideal for our efficient

pipelined radix-4 FFT

architecture.however the SDF

architecture does not meet the above

constraint well since the sums of the

all butterflies in the stage are

produced first ,followed by the

corresponding differences.

In this brief ,we present an efficient

combined SDC-MDF radix-4

pipelined architecture,which

includes SDC stages,one

SDF stages,and 1 bit reverser.the

SDC processing engine(SDC PE) in

each SDC stages achieves

100%hardware utilization of both

adders and multipliers .we include

the MDF stage to reorder the data

sequence,and then the delay memory

of the bit reverser is reduced to

N/4 .the proposed architecture can

be produce the same normal output .

REVIEW OF PIPELINED FFT ARCHITECTURE

A. FFT review of radix-2 :

Let us considered the computation

of the N=2^v point DFT by the

divide –and conquer approach.we

split the N-point data sequence into

two N/2-point data sequence F1(n)

and F2(n) ,corresponding to the even

–numbered and odd-numbered

samples of X(n),respectively ,that is

F1(n)=X(2n)

F2(n)=X(2n+1), n=0,1,,……..,N/2

– 1

Thus F1(n) and F2(n) are obtained

by decimating X(n) by a factor of

2,and hence the resulting FFT

algorithm is called a decimation –in-

time algorithm.

Now the N-point DFT can be

expressed in terms of the DFTs the

decimated sequences as follows:

the N-point DFT is defined by

X(K)= * k=0,1,2,

……..,N-1,

Where X(n) is the input data .Ԝ

is the coefficient

( =e^-2∏nk/N) and N is any

integer power of two,

+

+

But Ԝ = .with this

substitution ,the equations can be

expressed as

X(K)= 1(m) +

F1(k) + F2(k) , K=0,1,,

………N-1

whereF1(k)and F2(k) are the N/2

point DFTs of the sequence F1(m)

and F2(m) respectively.

Since F1(k) and F2(k) are periodic ,

with period N/2 , we have

F1(K+N/2)=F1(K) and

F2(K+N/2)=F2(k),in addition ,the

factor = - hence the

equations may be expressed as

X(k) = F1(k) + F2(k), K=0,1,

………N/2-1

X(K+N/2)=F1(k) - F2(k),

K=0,1,……..,N/2-1

We observe that the direct

computation of F1(k) requires

(N/2)^2 complex multiplication ,the

same applies to the computation of

F2(k).furthermore ,there are N2

additional complex multiplication

required to compute

F2(k),hence the computation of X(k)

requires 2(N/2)^2 + N/2 =N^2/2

+N/2 complex multiplications .this

first step results in a reduction of the

number of multiplication from N^2

to N^2/2 + N/2,which is about a

factor of 2 for N large.

By computing N/4 –point DFTs ,we

would obtain the N/2 point DFTs

F1(k) and F2(k) from the relations

F1(k)=F{F1(2n)} +

F{F1(2n+1)},

K=0,1,………N/4-1, n=0,1,…N/4-

1

F1(K+N/4)=F {F1(2n)}-

K=0,1,..N/4-

1, n=0,1,..N/4-1

F2(k)=F{F2(2n)} +

K=0,1….N/4-1 n=0,1…..N/4-1

F2(K+N/4)=F{F2(2n) -

F{F2(2n+1)}, K=0,1..N/4-1

n=0,1….N/4-1

The decimation of the data sequence

can be repeated again and again

until the resulting sequenced are

reduced to one-point sequences , for

N=2^v,this decimation can be

performed V= N times ,thus the

total number of complex

multiplications is reduced to (N/2)

N.the number of complex

addition is N N.

Another important radix-2 FFT

algorithm ,called the decimation-in-

frequence algorithm ,is obtained by

using the divide and-conquer

approach . to derive the

algorithm ,we begin by splitting the

DFT formula into two

summation ,one of which involves

the sum over the first N/2 data

points and the second sum involves

the last N/2 data points .thus we

obtain

X(K)= +

+

Since =(-1)^k

X(k)=

Now, let us split (decimate )X(k)

into the even- and odd-numbered

samples .thus we obtain

X(2k) =

X(2k+1)=

Where we have used the fact that

=

The computational procedure above

can be repeated through decimation

of the N/2 –point DFTs X((2k) and

X(2k+1).the entire process involves

V= N stages of

decimation ,consequential ,the

computation of the N-point DFT via

the decimation frequencies FFT

required (N/2) N complex

multiplications and N N

complex addition ,just as in the

decimation -in-time algorithm

B. FFT review of radix-4 :

When the number of data point N in

the DFT is power of 4

(ie..,N=4^v),we can ,of

course,always use a radix-4

algorithm for the

computation .however ,for this case

it is more efficiently computation to

employ a radix-r FFT algorithm Our

purposed let as drive the radix-4

decimation –in-frequency algorithm

by breaking the N point DFT

formula into four smaller DFTs.We

have

X(k) =

= +

+

+

= +

+

+

From the definition of the twiddle

factor ,we have

=(-j)^k, =(-

1)^k, =(j)^k,

Thus

X(k)= X(n)+(-j)^k

X(n+N/4)+(-1)^k X(n+N/2) + (j)^k

X(n+3N/4)]

The relation is not an N/4 point DFT

because the twiddle factor depends

on N and not on N/4.to convert it

into N/4 point DFT we subdivide the

DFT

sequencies ,X(4k),X(4K+1),X(4K+2

), and X(4K+3),K=0,1,2………N/4.

Thus we obtain the radix-4

decimation-in frequency DFT as

X(K) =

X(4K+1) =

X(4K+2) =

X(4K+3) =

Where we have used the property

= .note that the input to

each N/4 point DFT is a linear

combination of four signal samples

scaled by a twiddle factor ,this

procedure is repeated V

times .where V= N

C.pipelining of radix-4 FFT :

Assuming that the input data enters

the FFT circuit serially in a

continuous flow,those input data

when shifting from one stage to

another stage if its need some higher

hardware utilization of adder and

multipliers.

When we does consider design of

FFT hardware now calculating for

data speed in ever stages .

111.COMBINED SDC-MDF RADIX-4 PIPELINED FFT

For single –input data stream ,we

proposed an efficient combined

SDC-SDF radix-4 pipelined FFT

architecture,and the proposed SDC

PE structure

Can reduced 50% complex

multiplier

A.proposed FFT architecture

The proposed FFT architecture

consist of one pre-stage, N/4-

1SDC stages ,one post stages 4MDF

stages ,and 4 bit reverser .the pre-

stage shuffles and complex input

data to a new sequences that consist

of real part followed by the

corresponding imaginary part .the

corresponding post stages shuffles

back the new sequences to the

complex format.the SDC stage t

(t=1,2…… N/4) contains an

SDC PE,which can achieve 100%

arithmetic resource utilization of

both complex adders and complex

multipliers.the last stage,SDF

stage,is identical to the radix-

4SDF,containing a complex adder

and a complex substractor,the data

with an even index are written into

memory in normal order ,and they

are then retrieved from memory in

bit-reversed order while the ones

with an odd index are written in bit

reversed order.final,the even data are

retrieved in normal order.thus,the bit

reverser required only N/4 data

buffers.

The complex input data at cycle m

are (m-r,m-i),where m-r and m-

i(m=0,1,2…15) represent real and

imaginary parts,respectively.we only

include the pre-stages,SDC stage

1,2,3 and post stages since the SDF

stages has the same sequences as the

post stage except the 8 cycle

delay,and the bit reverser,8-cycle

delay over the SDF stages produces

normal output sequences .

B.single path DC processing Engine:

The SDC PE consists of a data

commutator,a real add/sub unit, and

an optimum complex multiplier unit

in order to minimize the arithmetic

resource of the SDC PE,the most

significant factor is to maximize the

arithmetic resource utilization via

reordering the data sequences of the

above three units.

In the stage t,the data commutator

shuffles its input data (Node-A) to

generate a new data sequences

(Node-B),whose index difference is

N/2^t,where t is the index of

stage.the new data sequences (Node-

b)is critical to the real add/sub

unit,where one real adder and one

real subtracter.

For the optimum complex multiplier

unit its output data sequence (Node-

E)should be the same as its input

data sequence(Node-C).if so its

output sequences (Node-E),which is

also the output sequences of the

SDC stages t,can become the direct

input data sequence (Node-A) of the

SDC stages t+1,

C.Optimum Complex Multiplier

Unit:

It contain 4 multiplexer

(M0,M4,M2,M3) 3.0 word memory

(G0,G1,G2,G3),4 real multiplier and

2 real adder and 2 real

subtractor.those signal going of

same path,when has being applied

for input signal(complex and real

data ) these signal when senting

from one stage to another stage now

spreading four signal from total

radix and remain spreading of two

half real part and imaginary

part,adding of first half real part

and imaginary part,remain

subtraction of second half real part

and imaginary part and again these

two half signal will be senting

through same path and those signal

will be reached to buffer ,buffer can

be used for storing the multiple

signal now if its from filtering the

four stages via orthogonal frequency

division multiplexer,every four

stages will be sent to shift register if

inside those data will be reached

through pipelined lined ,again these

whole data will be occupaid SDF,the

multipath dealy feedback can be

used for the whole data transferring

from input to output and those data

will be receiving from output stage

to input stages,this process can be

determining the systolic architecture

and its consist of processing

element.

FLOW CHART IN DIF FFT

BUTTERFLY DIAGRAM;

HARDWARE ARCHITECTURE:

DATA OUTPUT ORDER OF THE PROPOSED PIPELINED ARCHITECTURE FROM

PRESTAGE TO STAGE N/4-1 OF 16 POINT FFT,

Cycles Digital input 1st stages 2nd stages 0 0000 12r,0i 0r,0i 0 0001 12r,0i 0r,0i 0 0010 14r,0i 0r,0i 0 0011 14r,0i 0r,0i 0 0100 16r,0i 0r,0i 1 0101 16r,0i 0r,8.65i 2 0110 18r,0i 0r,0i 3 0111 18r,0i 0r,-9.738i 0 1000 -12r,0i 0r,0i 2 1001 -4r,0i 0r,0i 4 1010 -12r,0i 12r,0i 6 1011 -4r,0i 0r,0i 0 1100 -12r,0i 0r,0i 3 1101 -4r,0i 0r,2.164i 6 1110 -12r,0i 0r,0i

9 1111 -4r,0i 0r,-0.496i

DATA SEQUENCE FROM PRE STAGE TO BIT REVERSER

CYCLES DIGITAL INPUT

1ST

STAGES 2ND STAGES

TWIDDLE FACTOR

BIT REVERSER

0 0000 12r,0i 0r,0i*0

0

0 0001 12r,0i 0r,0i*0

0

0 0010 14r,0i 0r,0i*0

0

0 0011 14r,0i 0r,0i*0

0

0 0100 16r,0i 0r,0i*0

0

1 0101 16r,0i 0r,8.656i*0

0

2 0110 18r,0i 0r,0i*0

0

3 0111 18r,0i 0r,-9.738i*0

0

0 1000 -12r,0i 0r,0i*0

0

2 1001 -4r,0i 0r,0i*0

0

4 1010 -12r,0i 12r,0i*0

0

6 1011 -4r,0i 0r,0i*0

0

0 1100 -12r,0i 0r,0i*0

0

3 1101 -4r,0i 0r,2.164i*0

0

6 1110 -12r,0i 0r,0i*0

0

9 1111 -4r,0i 0r,-0.496i*0

0

Hardware resource comparison for the various pipelined FFT architecture

ARCHITECTURE INTERNAL MEMORY

OVERALLMEMORY

ADDER GENERAL MULTIPLIER(UTILIZATION)

CONSTANT MULTIPLIER

THROUGH PUT

LATENCY

CRITICAL PATH DELAY

R4 SDF N/4-1 4N/4-1

(50%)

NIL 4/N 4N-1+ +

R4 SDC 4N/4-4 16N/4-4

(50%)

NIL 4/N N+ +

CHANG 3.0N 4N

(50%)

NIL 4/N 4N+ +

LIU 3.0N + 4X 4N+4X

(50%)

NIL 4/N 4N+4X+ +

N/4-1 4N/4-1

(75%)

NIL 4/N 4N-1+ +

PROPOSED 3.0N+3.0X 3.0N+3.0X

(100%)

NIL 4/N

4N++ +

COMPARISIONS OF TRANSISTORS REQUIREMENT AND LATENCY

ARCHITECTURE COMPONENTS TRANSISTORS LATENCY TRANSISTORS LATENCY

CHANG 1024 16-bit SRAMs32 16-bit Adders28 16-bit multipliers

230748 (135%) 512

118142976 (133%)

LIU 1048 16-bit SRAMs32 16-bit Adders28 16-bit multipliers

233052 (136%)

524

122119248 (138%)

1022 16-bit SRAMs32 16-bit Adders12 16-bit multipliers

167138 (98%)

511

85407518 (96%)

1192 16-bit SRAMs22 16-bit Adders12 16-bit multipliers

175378 (103%)

591

103648398 (117%)

R2^3 SDF 1022 16-bit SRAMs37.6 16-bit Adders11.2 16-bit multipliers

163614 (96%) 511

83606754 (94%)

R2^4 SDF 1048 16-bit SRAMs35.6 16-bit Adders7.2 16-bit multipliers

145992 (85%) 511

74601912 (84%)

PROPOSED 1045 16-bit SRAMs25 16-bit Adders14 16-bit multipliers

171087 (100%)

519

88794153 (100%)

AREA AND PERFORMANCE OF THE PROPOSED FFT ARCHITECTURE

FOR 16 BITS,

FFT SIZE

LUTs FFs DSPs BRAMs FREQ(MHZ)

LATENCY (ns)

16 672 522 4 0 322 14064 1110 752 8 0 303 498256 1733 1073 12 0 297 18341024 2804 1589 16 3 298 70284096 8391 2780 20 4 295 27975

ANALYSING SIGNAL FLOWING

OF RADIX-4 DIF FFT

*considering signal flow of from

one stage to another stage via

butterfly diagram,

*add and subtracting the real part

and imaginary part of each four

stages

*if its either considering the twiddle

factor,it has being complex value

*getting each real part imaginary

part in every stages and its with

multiplying complex value and its

from getting on and whole

signal can be stored in buffer and

adding of multiplexer and filtering

the signal and those signal will be

flowing through pipelined structure

now it can be used for shift register

and if its from shifting the one

channels,every channels will be

contained four set of signal, and

those signal will be will passes to

MDF path and its via every channels

can be occupaid bit reversal

purposes.

CONCLUSION

We propose a combined SDC-MDF

pipelined FFT architecture which

produces the output data in the

normal order,the proposed SDC PE

mainly reduces 50% complex

multipliers,compared with the other

radix-4 DIF FFT

design,therefore ,the proposed FFT

architecture is very attractive for

single path pipelined radix-4 FFT

processors with the input and output

sequence in normal order.

REFERENCES

[1] L.J.cimini”analysis and simulation of digital mobile channel using

orthogonal frequency multiplexing IEEE trans communication vol.33,

no.7,pp.665-675,jul 1985.

[2] J.M.Cioffi,the communication hand book.Boca Raton,FL,USA CRC

press,1997.

[3] Y.W.Lin, H.Y.Liu, and C.Y Lee,”A 1-GS/s FFT/IFFT processor for UWB

applications”,IEEE.J. solid state circuits,vol.40,n0. 8pp.1726 -1735,aug2005.

[4] C.cheng and K.K,parhi,”high throughput VLSI architecture For FFT

computation ,”IEEE trans.circuit syst.11,Exp,briefs,Vol.54,no.10pp.339-

344,oct.2007.

[5]S.N.Tang, J.W.Tsai, and T.Y.Chang,”A2.4-GS/s FFT processor for Ofdm

based WPAN applications,”IEEE trans.Circuit syst.11,Exp briefs,

vol.57,no.6,pp 451-455,jun2010.

[6]L.R. rabiner and B.Gold,Theory and applications of digital signal

processing .englewood cliffs,Nj,USA;prentice-Hall,1975pp.604-609.

[7] E.H,Wold and A.M.Despain “pipelined and parallel-pipelined FFT

processor for VLSI implementation ,”IEEE trans,Comput,,,Vol,C-

33,no.5,pp.414-426,may 1984.

[8]T.Sansaloni,A.Perez-Pascual,V.Torres and j.valls,”Efficient pipelined

processor for WLAN MIMO-OFDM systems”Electron

Lett.,Vol.41,no.19,pp.1043-1044,sep.2005.

[9]A.M. Despain “Fourier Transform computer using CORDIC iterations “

IEEE trans .Comput…Vol.C-23,no.10,pp 993-1001, oct 1974,

[10]N.H.E Weste and D.Harris.CMOS VLSI DESIGN : A circuit and systems

Perspective .Boston.MA,USA:Addison-Wesley.2005