Upload
sakthi1986
View
19
Download
2
Tags:
Embed Size (px)
Citation preview
A Combined SDC-SDF Architecture For Normal I/O Pipelined Radix-4 FFT
S.MAGESHKUMAR,
Department of ECE,
Asan Memorial College Of Engineering And Technology,
Abstract
We present an efficient combined
single-path delay commutator and
multi-path delay feedback (SDC-
SDF) radix-4 pipelined fast fourier
transform architecture.which
includes SDC stages,and
one SDF stages .The SDC
processing engine is proposed to
achieve 100% hardware resource
utilization by sharing the common
arithmetic resource in the time-
multiplexed approach,including both
adders and multipliers is reduced to
compared with
for the other radix-4 SDC-SDF
architecture .in addition the
proposed architecture requires
roughly minimum number of
complex adders and
complex delay memory 4N+3.0
.
Intex Terms-Fast Fourier Transform
(FFT),pipelined architecture ,single
path delay communicator processing
elements (SDC PF).
Introduct
ion
Fast Fourier Transform(FFT) has
played a significant role in digital
signal processing field,especially in
the advanced communication
systems,such as orthogonal
frequency multiplexing
(OFDM),and asymmetric digital
subscriber line .all these systems
require that the FFT computation
must be high throughput and low
latency.therefore ,designing a high
performance FFT circuit is an
efficient solution to the
abovementioned problems,in
particular the pipelined FFT
architecture have mainly been
adopted to address the difficulties
Due to their attractive .such as small
chip area .high throughput ,and
power consumption.
To the best our knowledge ,two
types of pipelined FFT architecture
can be found in this brief: delay feed
back (DF) and delay commutator
(DC).further according to the
number of inputs data stream paths,
they can be classified in to multi
path (M) and single path (S)
architectures.the two classification
form four kind of pipelined FFT
architecture,are often adopted when
the throughput requirement is
beyond the theoretical limitation that
the single-path architecture can offer
at a given clock
frequency.however ,they require
concurrent read (write) operation for
the multipath input (output)
data .therefore ,single –path (s)
architecture could be appropriate in
some cases when the system cannot
ensure concurrent
operations .however ,the arithmetic
utilization is relatively
low,compared with 100%
utilizations of the existing
MDF/MDC architecture.in also
achieve 100% multiplier utilization
by reordering the inner data
sequence.
For single input data stream,the
conventional radix-4 SDF FFT
architecture requires
complex adders and
complex multipliers,where N is the
FFT size.both chang [11] and Liu et
al.[12] present the novel SDC
architecture to reduce 50% complex
adders by reordering inner data
sequence.however the utilization of
the corresponding comple multiplier
still remain 50%for the both
architecture.we therefore study
whether the complex multiplier unit
can be modeified to achieve the
100% utilization.
In the radix-4 FFT
architecture ,there is a common
observation that one half data (sum
part of butterfly operation) do not
involve complex multiplication (W
) at all. While other half
(difference part)indeed involves
complex multiplication (W
.hence it has the opportunity to
achieve the objective that reduces
the arithmetic resource of the
conventional complex multiplier by
a factor of 2,leading to 100%
utilization .it is ideal for two
consecutive complex input data to
contain a complex number, which
needs to execute complex
multiplication ,if so w can minimize
the reordering memory requirement
while achieving the above objective
that reduces 50% the arithametic
resurce of complex multipliers.
Fortunately,the improved SDC
architecture can be produce the sum
and corresponding difference results
of a butterfly operation in
consecutive two cycles .the sum part
is directly passed to the next
stages,while the difference parts
need to execute complex
multiplication before passing to the
next stage .therefore,the SDC
architecture is ideal for our efficient
pipelined radix-4 FFT
architecture.however the SDF
architecture does not meet the above
constraint well since the sums of the
all butterflies in the stage are
produced first ,followed by the
corresponding differences.
In this brief ,we present an efficient
combined SDC-MDF radix-4
pipelined architecture,which
includes SDC stages,one
SDF stages,and 1 bit reverser.the
SDC processing engine(SDC PE) in
each SDC stages achieves
100%hardware utilization of both
adders and multipliers .we include
the MDF stage to reorder the data
sequence,and then the delay memory
of the bit reverser is reduced to
N/4 .the proposed architecture can
be produce the same normal output .
REVIEW OF PIPELINED FFT ARCHITECTURE
A. FFT review of radix-2 :
Let us considered the computation
of the N=2^v point DFT by the
divide –and conquer approach.we
split the N-point data sequence into
two N/2-point data sequence F1(n)
and F2(n) ,corresponding to the even
–numbered and odd-numbered
samples of X(n),respectively ,that is
F1(n)=X(2n)
F2(n)=X(2n+1), n=0,1,,……..,N/2
– 1
Thus F1(n) and F2(n) are obtained
by decimating X(n) by a factor of
2,and hence the resulting FFT
algorithm is called a decimation –in-
time algorithm.
Now the N-point DFT can be
expressed in terms of the DFTs the
decimated sequences as follows:
the N-point DFT is defined by
X(K)= * k=0,1,2,
……..,N-1,
Where X(n) is the input data .Ԝ
is the coefficient
( =e^-2∏nk/N) and N is any
integer power of two,
+
+
But Ԝ = .with this
substitution ,the equations can be
expressed as
X(K)= 1(m) +
F1(k) + F2(k) , K=0,1,,
………N-1
whereF1(k)and F2(k) are the N/2
point DFTs of the sequence F1(m)
and F2(m) respectively.
Since F1(k) and F2(k) are periodic ,
with period N/2 , we have
F1(K+N/2)=F1(K) and
F2(K+N/2)=F2(k),in addition ,the
factor = - hence the
equations may be expressed as
X(k) = F1(k) + F2(k), K=0,1,
………N/2-1
X(K+N/2)=F1(k) - F2(k),
K=0,1,……..,N/2-1
We observe that the direct
computation of F1(k) requires
(N/2)^2 complex multiplication ,the
same applies to the computation of
F2(k).furthermore ,there are N2
additional complex multiplication
required to compute
F2(k),hence the computation of X(k)
requires 2(N/2)^2 + N/2 =N^2/2
+N/2 complex multiplications .this
first step results in a reduction of the
number of multiplication from N^2
to N^2/2 + N/2,which is about a
factor of 2 for N large.
By computing N/4 –point DFTs ,we
would obtain the N/2 point DFTs
F1(k) and F2(k) from the relations
F1(k)=F{F1(2n)} +
F{F1(2n+1)},
K=0,1,………N/4-1, n=0,1,…N/4-
1
F1(K+N/4)=F {F1(2n)}-
K=0,1,..N/4-
1, n=0,1,..N/4-1
F2(k)=F{F2(2n)} +
K=0,1….N/4-1 n=0,1…..N/4-1
F2(K+N/4)=F{F2(2n) -
F{F2(2n+1)}, K=0,1..N/4-1
n=0,1….N/4-1
The decimation of the data sequence
can be repeated again and again
until the resulting sequenced are
reduced to one-point sequences , for
N=2^v,this decimation can be
performed V= N times ,thus the
total number of complex
multiplications is reduced to (N/2)
N.the number of complex
addition is N N.
Another important radix-2 FFT
algorithm ,called the decimation-in-
frequence algorithm ,is obtained by
using the divide and-conquer
approach . to derive the
algorithm ,we begin by splitting the
DFT formula into two
summation ,one of which involves
the sum over the first N/2 data
points and the second sum involves
the last N/2 data points .thus we
obtain
X(K)= +
+
Since =(-1)^k
X(k)=
Now, let us split (decimate )X(k)
into the even- and odd-numbered
samples .thus we obtain
X(2k) =
X(2k+1)=
Where we have used the fact that
=
The computational procedure above
can be repeated through decimation
of the N/2 –point DFTs X((2k) and
X(2k+1).the entire process involves
V= N stages of
decimation ,consequential ,the
computation of the N-point DFT via
the decimation frequencies FFT
required (N/2) N complex
multiplications and N N
complex addition ,just as in the
decimation -in-time algorithm
B. FFT review of radix-4 :
When the number of data point N in
the DFT is power of 4
(ie..,N=4^v),we can ,of
course,always use a radix-4
algorithm for the
computation .however ,for this case
it is more efficiently computation to
employ a radix-r FFT algorithm Our
purposed let as drive the radix-4
decimation –in-frequency algorithm
by breaking the N point DFT
formula into four smaller DFTs.We
have
X(k) =
= +
+
+
= +
+
+
From the definition of the twiddle
factor ,we have
=(-j)^k, =(-
1)^k, =(j)^k,
Thus
X(k)= X(n)+(-j)^k
X(n+N/4)+(-1)^k X(n+N/2) + (j)^k
X(n+3N/4)]
The relation is not an N/4 point DFT
because the twiddle factor depends
on N and not on N/4.to convert it
into N/4 point DFT we subdivide the
DFT
sequencies ,X(4k),X(4K+1),X(4K+2
), and X(4K+3),K=0,1,2………N/4.
Thus we obtain the radix-4
decimation-in frequency DFT as
X(K) =
X(4K+1) =
X(4K+2) =
X(4K+3) =
Where we have used the property
= .note that the input to
each N/4 point DFT is a linear
combination of four signal samples
scaled by a twiddle factor ,this
procedure is repeated V
times .where V= N
C.pipelining of radix-4 FFT :
Assuming that the input data enters
the FFT circuit serially in a
continuous flow,those input data
when shifting from one stage to
another stage if its need some higher
hardware utilization of adder and
multipliers.
When we does consider design of
FFT hardware now calculating for
data speed in ever stages .
111.COMBINED SDC-MDF RADIX-4 PIPELINED FFT
For single –input data stream ,we
proposed an efficient combined
SDC-SDF radix-4 pipelined FFT
architecture,and the proposed SDC
PE structure
Can reduced 50% complex
multiplier
A.proposed FFT architecture
The proposed FFT architecture
consist of one pre-stage, N/4-
1SDC stages ,one post stages 4MDF
stages ,and 4 bit reverser .the pre-
stage shuffles and complex input
data to a new sequences that consist
of real part followed by the
corresponding imaginary part .the
corresponding post stages shuffles
back the new sequences to the
complex format.the SDC stage t
(t=1,2…… N/4) contains an
SDC PE,which can achieve 100%
arithmetic resource utilization of
both complex adders and complex
multipliers.the last stage,SDF
stage,is identical to the radix-
4SDF,containing a complex adder
and a complex substractor,the data
with an even index are written into
memory in normal order ,and they
are then retrieved from memory in
bit-reversed order while the ones
with an odd index are written in bit
reversed order.final,the even data are
retrieved in normal order.thus,the bit
reverser required only N/4 data
buffers.
The complex input data at cycle m
are (m-r,m-i),where m-r and m-
i(m=0,1,2…15) represent real and
imaginary parts,respectively.we only
include the pre-stages,SDC stage
1,2,3 and post stages since the SDF
stages has the same sequences as the
post stage except the 8 cycle
delay,and the bit reverser,8-cycle
delay over the SDF stages produces
normal output sequences .
B.single path DC processing Engine:
The SDC PE consists of a data
commutator,a real add/sub unit, and
an optimum complex multiplier unit
in order to minimize the arithmetic
resource of the SDC PE,the most
significant factor is to maximize the
arithmetic resource utilization via
reordering the data sequences of the
above three units.
In the stage t,the data commutator
shuffles its input data (Node-A) to
generate a new data sequences
(Node-B),whose index difference is
N/2^t,where t is the index of
stage.the new data sequences (Node-
b)is critical to the real add/sub
unit,where one real adder and one
real subtracter.
For the optimum complex multiplier
unit its output data sequence (Node-
E)should be the same as its input
data sequence(Node-C).if so its
output sequences (Node-E),which is
also the output sequences of the
SDC stages t,can become the direct
input data sequence (Node-A) of the
SDC stages t+1,
C.Optimum Complex Multiplier
Unit:
It contain 4 multiplexer
(M0,M4,M2,M3) 3.0 word memory
(G0,G1,G2,G3),4 real multiplier and
2 real adder and 2 real
subtractor.those signal going of
same path,when has being applied
for input signal(complex and real
data ) these signal when senting
from one stage to another stage now
spreading four signal from total
radix and remain spreading of two
half real part and imaginary
part,adding of first half real part
and imaginary part,remain
subtraction of second half real part
and imaginary part and again these
two half signal will be senting
through same path and those signal
will be reached to buffer ,buffer can
be used for storing the multiple
signal now if its from filtering the
four stages via orthogonal frequency
division multiplexer,every four
stages will be sent to shift register if
inside those data will be reached
through pipelined lined ,again these
whole data will be occupaid SDF,the
multipath dealy feedback can be
used for the whole data transferring
from input to output and those data
will be receiving from output stage
to input stages,this process can be
determining the systolic architecture
and its consist of processing
element.
FLOW CHART IN DIF FFT
BUTTERFLY DIAGRAM;
HARDWARE ARCHITECTURE:
DATA OUTPUT ORDER OF THE PROPOSED PIPELINED ARCHITECTURE FROM
PRESTAGE TO STAGE N/4-1 OF 16 POINT FFT,
Cycles Digital input 1st stages 2nd stages 0 0000 12r,0i 0r,0i 0 0001 12r,0i 0r,0i 0 0010 14r,0i 0r,0i 0 0011 14r,0i 0r,0i 0 0100 16r,0i 0r,0i 1 0101 16r,0i 0r,8.65i 2 0110 18r,0i 0r,0i 3 0111 18r,0i 0r,-9.738i 0 1000 -12r,0i 0r,0i 2 1001 -4r,0i 0r,0i 4 1010 -12r,0i 12r,0i 6 1011 -4r,0i 0r,0i 0 1100 -12r,0i 0r,0i 3 1101 -4r,0i 0r,2.164i 6 1110 -12r,0i 0r,0i
9 1111 -4r,0i 0r,-0.496i
DATA SEQUENCE FROM PRE STAGE TO BIT REVERSER
CYCLES DIGITAL INPUT
1ST
STAGES 2ND STAGES
TWIDDLE FACTOR
BIT REVERSER
0 0000 12r,0i 0r,0i*0
0
0 0001 12r,0i 0r,0i*0
0
0 0010 14r,0i 0r,0i*0
0
0 0011 14r,0i 0r,0i*0
0
0 0100 16r,0i 0r,0i*0
0
1 0101 16r,0i 0r,8.656i*0
0
2 0110 18r,0i 0r,0i*0
0
3 0111 18r,0i 0r,-9.738i*0
0
0 1000 -12r,0i 0r,0i*0
0
2 1001 -4r,0i 0r,0i*0
0
4 1010 -12r,0i 12r,0i*0
0
6 1011 -4r,0i 0r,0i*0
0
0 1100 -12r,0i 0r,0i*0
0
3 1101 -4r,0i 0r,2.164i*0
0
6 1110 -12r,0i 0r,0i*0
0
9 1111 -4r,0i 0r,-0.496i*0
0
Hardware resource comparison for the various pipelined FFT architecture
ARCHITECTURE INTERNAL MEMORY
OVERALLMEMORY
ADDER GENERAL MULTIPLIER(UTILIZATION)
CONSTANT MULTIPLIER
THROUGH PUT
LATENCY
CRITICAL PATH DELAY
R4 SDF N/4-1 4N/4-1
(50%)
NIL 4/N 4N-1+ +
R4 SDC 4N/4-4 16N/4-4
(50%)
NIL 4/N N+ +
CHANG 3.0N 4N
(50%)
NIL 4/N 4N+ +
LIU 3.0N + 4X 4N+4X
(50%)
NIL 4/N 4N+4X+ +
N/4-1 4N/4-1
(75%)
NIL 4/N 4N-1+ +
PROPOSED 3.0N+3.0X 3.0N+3.0X
(100%)
NIL 4/N
4N++ +
COMPARISIONS OF TRANSISTORS REQUIREMENT AND LATENCY
ARCHITECTURE COMPONENTS TRANSISTORS LATENCY TRANSISTORS LATENCY
CHANG 1024 16-bit SRAMs32 16-bit Adders28 16-bit multipliers
230748 (135%) 512
118142976 (133%)
LIU 1048 16-bit SRAMs32 16-bit Adders28 16-bit multipliers
233052 (136%)
524
122119248 (138%)
1022 16-bit SRAMs32 16-bit Adders12 16-bit multipliers
167138 (98%)
511
85407518 (96%)
1192 16-bit SRAMs22 16-bit Adders12 16-bit multipliers
175378 (103%)
591
103648398 (117%)
R2^3 SDF 1022 16-bit SRAMs37.6 16-bit Adders11.2 16-bit multipliers
163614 (96%) 511
83606754 (94%)
R2^4 SDF 1048 16-bit SRAMs35.6 16-bit Adders7.2 16-bit multipliers
145992 (85%) 511
74601912 (84%)
PROPOSED 1045 16-bit SRAMs25 16-bit Adders14 16-bit multipliers
171087 (100%)
519
88794153 (100%)
AREA AND PERFORMANCE OF THE PROPOSED FFT ARCHITECTURE
FOR 16 BITS,
FFT SIZE
LUTs FFs DSPs BRAMs FREQ(MHZ)
LATENCY (ns)
16 672 522 4 0 322 14064 1110 752 8 0 303 498256 1733 1073 12 0 297 18341024 2804 1589 16 3 298 70284096 8391 2780 20 4 295 27975
ANALYSING SIGNAL FLOWING
OF RADIX-4 DIF FFT
*considering signal flow of from
one stage to another stage via
butterfly diagram,
*add and subtracting the real part
and imaginary part of each four
stages
*if its either considering the twiddle
factor,it has being complex value
*getting each real part imaginary
part in every stages and its with
multiplying complex value and its
from getting on and whole
signal can be stored in buffer and
adding of multiplexer and filtering
the signal and those signal will be
flowing through pipelined structure
now it can be used for shift register
and if its from shifting the one
channels,every channels will be
contained four set of signal, and
those signal will be will passes to
MDF path and its via every channels
can be occupaid bit reversal
purposes.
CONCLUSION
We propose a combined SDC-MDF
pipelined FFT architecture which
produces the output data in the
normal order,the proposed SDC PE
mainly reduces 50% complex
multipliers,compared with the other
radix-4 DIF FFT
design,therefore ,the proposed FFT
architecture is very attractive for
single path pipelined radix-4 FFT
processors with the input and output
sequence in normal order.
REFERENCES
[1] L.J.cimini”analysis and simulation of digital mobile channel using
orthogonal frequency multiplexing IEEE trans communication vol.33,
no.7,pp.665-675,jul 1985.
[2] J.M.Cioffi,the communication hand book.Boca Raton,FL,USA CRC
press,1997.
[3] Y.W.Lin, H.Y.Liu, and C.Y Lee,”A 1-GS/s FFT/IFFT processor for UWB
applications”,IEEE.J. solid state circuits,vol.40,n0. 8pp.1726 -1735,aug2005.
[4] C.cheng and K.K,parhi,”high throughput VLSI architecture For FFT
computation ,”IEEE trans.circuit syst.11,Exp,briefs,Vol.54,no.10pp.339-
344,oct.2007.
[5]S.N.Tang, J.W.Tsai, and T.Y.Chang,”A2.4-GS/s FFT processor for Ofdm
based WPAN applications,”IEEE trans.Circuit syst.11,Exp briefs,
vol.57,no.6,pp 451-455,jun2010.
[6]L.R. rabiner and B.Gold,Theory and applications of digital signal
processing .englewood cliffs,Nj,USA;prentice-Hall,1975pp.604-609.
[7] E.H,Wold and A.M.Despain “pipelined and parallel-pipelined FFT
processor for VLSI implementation ,”IEEE trans,Comput,,,Vol,C-
33,no.5,pp.414-426,may 1984.
[8]T.Sansaloni,A.Perez-Pascual,V.Torres and j.valls,”Efficient pipelined
processor for WLAN MIMO-OFDM systems”Electron
Lett.,Vol.41,no.19,pp.1043-1044,sep.2005.
[9]A.M. Despain “Fourier Transform computer using CORDIC iterations “
IEEE trans .Comput…Vol.C-23,no.10,pp 993-1001, oct 1974,
[10]N.H.E Weste and D.Harris.CMOS VLSI DESIGN : A circuit and systems
Perspective .Boston.MA,USA:Addison-Wesley.2005