7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11
1/4
IMPLEMENTATION OF CHANNEL DEMODULATOR
FOR DAB SYSTEM
C h i e n - M i n g Wu', Min g-D er Sh ieh ', Hs in-Fu Lo ,
and
M i n - H s i u n g
HuZ
G r a d u a t e S c h o o l of E n g i n e e r i n g Science T e c h n o l og y , N a t i o n a l Yunlin U n i v e r s i t y of S c i e n c e T e c h n o l og y , T a i w a n
D e p a r t m e n t of E l e c t r o n i c E n g i n e e r i n g ,
National Yunlin
U n i v e r s i t y
of
Science
T e c h n o l o g y , T a i w a n
D i v i s i o n of Design S e r v i c e, N a t i o n S c i e n c e C o u n c i l C h i p I m p l e m e n t a t io n C e n t e r ( C I C ) , T a i w a n
2
ABSTRACT
This paper describes the VLSI implementation of Fast Fourier
Transform FIT) for the . Eureka-147 Digital Audio Broadcasting
(DAB) system. We emphasize how
' t o
m i n i i e t h e ha rd wa re
requirement and efficiently manage the memory to meet the DAB
requirement. Implementation results demo nstrate the applicability of
our w ork w ith the characteristics
of
modular design, consuming less
silicon area, and facilitating the extension for high transmission rate
applications. The core size of the resulting chip implementation is
2086x1806 pmz
based o n the TSMC 0.35 1P4M CMOS
process. Performance evaluation reveals that our design for the
targeted channel demodulator o utperform previous solutions.
1.
INTRODUCTION
Th e Digital Audio Broadcasting (DAB) system, described in the
European Eureka-I47 standard [I], offers high-quality audio
services, supports multimedia data
to
mobile reception and might
replace the traditional radio system. Basically, two strategies are
employed to implement the DAB receiver: the DSP-based
architecture [Z,
1
and the ASIC-based implementation [4, 51. The
former has the characteristics of maximum flexibility, ease of
use
and sImple programming, but it can only provide limited processing
ca pa b ty . 0n .th e contrary, the ASIC-based implementation has the
potentials of: supporting real-time symbol decoding and low-cost
Implementation.
Figure 1: shows an o verview
of
the DAB system, in which the
ISONPEG coding is adopted for source coding and COFDM
(Coded Orthogonal Frequency Division Multiplexing) for channel
coding and' modulation [I]. After convolutional coding, the
generated codewords are interleaved in frequency for the fast
information channel and in both time and.frequency for the main
service channel, and then the OFDM modulation is performed.
In
this paper, we focus on the design and implementation of the
channel demodulator, which essentially perform a Fast Fourier
Transform
(FFT). In
general, two basic types of
F FT
architectures
can be found in the literature: the pipelined orchirecture with each
stage consisting of a butterfly unit 16, 71 and the single burrerfly
architecture
1.5, 81 that employs just one radix-r butterfly unit. The
main concern s the trade-off between hardware o verhead and speed
requirement.
Although the pipelined architecture can provide a higher
throughput rate than the single butterfly implementation, we are still
interested in the single butterfly architecture because of the
specifcations of the channel demodulator as well as the hardware
considerations on the implementation of DAB receivers. For the
single butterfly Implementation, a basic problem that arises
is
how
to
eEciently mange memory readwrite accesses for the purposes
of increasing
its
throughput rate. The common
solutions
include: (1)
Use the high-radix implementation to reduce the
total
number of
memory accesses at the expense of increasing the arithmetic
complexity, i.e., the hardware requirement of a high-radix butterfly,
unit.
(2)
Partition the memory into several banks in order
to
allow
concurrent accesses of multiple data with a more complicated
addressing scheme, which might correspond to a higher routing area.
In this paper, we describe the design and implementation of the
FIT for the DAB ch annel demodulator. We show our experiences
on
how to use the conflict-free memory addressing arrangement in
191
to minimize the hardware requirement and to match the DAB
requirement. Implementation results de monstrate the applicability of
our work
to
the targeted channel demodulator and the advantages
over previous solutions [ 5 , 71 in terms of hardware requirement.
The rest of this paper
is
organized
as
follows: Section
2
reviews the
background and ou r previous work [ 9] related to this paper. Section
3 describes the resulting architecture and design of
FFT
processor.
Then, the corresponding chip implementation and performance
evaluation are shown in S ection 4. Finally, Section 5 concludes this
work.
conYolul ionald i n g
OFDM transminer
d i n g
inrerIEaving
Chaskd
N o m and Retlcclion
Figure
1 An
overview of the DAB system [SI.
2
PRELIMINARY RESULTS
x(k) is defmed as
The N-point Discrete Fourier Transform (DFT)
of a
sequence
where
n
=
0, 1,
..., N-l and W
=
e-J2 . From Eq.
( I ) ,
we know
that N2 multiplications and N(N-1) additions are needed to directly
perform the required computations. By applying the
FIT,
the
computational complexity can be down to a number in O(N log M .
If the number of sampled points is a power of the radix r, then it
is easy
to
compute the D F I by using a radix-r
FF'I
algorithm In
such a case, the N-point DFT can be decomposed into a set of
recursively related r-point transforms. The decimation
in
time (DIT)
and decimation in frequency (DIF) are two basic classes of
FIT
algorithm [lo]. Specifically, the DIT FF I algorithm is based on
decomposing the input sequence x(k) into successively smaller and
smaller subsequences. The DIF
F FT
algorithm is
to
decompose the
output sequence
X n )
into smaller subsequences in the same way.
Figure
2
shows a DIT 8-point
FIT
algorithm, in which the data in
each stage can be processed based
on
the so-called butterfly units.
E137
-7803-7761-31031117.002003
EEE
7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11
2/4
Figure 2. The data flow graph of D IT E-point I omputation
In
general, an N-point I computation requires (N/r)xlog,N
radix-r butterfly computations and either the pipelined architecture
or the single butterfly architecture can be selected for a dedicated
application. For the single butterfly implementation, it implies
2Nxlog,N mem ory accesses, which are the main bottlenecks for fast
FFT computation. Therefore, we need an efficient memory
management strategy to overcome this problem, i.e., to reduce the
number of memory accesses
or
to increase the memory bandwidth.
In
our previous work
[91,
we have presented a set of simple but
efficient equations to partition the memory into a number
of
memory banks such that the equivalent memory bandwidth can be
increased with simple interconnection networks.
As
known, let be the numbe r of stage s for the
FFT
computation, then the value can be compu ted by
m = l l o g r
(2)
Following the notation of conventional number system, it is
assumed that the original memory address
4, is
expressed in
unsigned radix-r representation defined as
where
di is an
integer and
0 5 di 5 r-I. In
consequence, a feasible
solution to partition the memory into
r
banks can be easily obtained
as
shown in
Eq. 4),
which implies that the original address
(4,
will
be distributed into the bank number
B d, r .
The correctness of
Eq.
(4) is assured by observing that for a
given
butterfly index, the
equation con tains the distinguishable variable at each stage.
4.
& . I . c L . 2 . .
..
a,
d .
do , (3)
B d, r ) = d,,., d,,,. d z d l 4
mod
r
(4)
Finally, we consider the mapping of 4,nto one of the address
locations of the selected bank B d,
r .
To simp@ the hardware
implementation, the
assigned
address
BA d, r in
the bank
B d, r ) is
obtained by discarding the least significant digit of the original
address. Equation 5 ) causes
no
conflict due to the fact that for two
original addresses that differ in only the least signifcant digit, they
are distributed into different banks based on
Eq.
4) because of 0 S
d0Sr- l .
BA d. r)
=
dn,.t. d,,,.. ...,
4.
4
5 )
3 FFT DESIGN AND IMPLEMENTATION
Figure
3
depicts the block diagram o f the single butterfly
architecture for our
FIT
processor. It operates
on
a 24.576 MHz
clock and consists of a simple radix-2 DIT butterfly unit, a single-
port
FFT
RAM, a coefficient RO M, a control unit, and an address-
generate unit (AGU).
AU
variables are complex and the intemal
datapath widths are either 8 o r
16
bits. The details of the
VLSI
realization are described in the following subsections.
onlml
Unit
eh
acff. Butlsrt lynit
ROM
Figure
3.
Block diagram of the Tprocessor,
3 1MemoryArrangement
For memory arrangement, first we have to.decide whether the
ping-pong mode or in-place mode
is
to be applied to store the
intermediate values when implementing the FFT
RAM.
The main
disadvantage of the former is that twice
as
many memory spaces are
required in comparison with the in-place operation, but the control
circuit is easy, For in-place scheduling, exactly one memory space is
needed for storing the intermediate values and the old computed
values are immediately overwritten by the newly computed values.
This
is an
important feature for the realization of long
FFTs
due to
the fact that area for storing the large amount of intermediate results
will occupy a significant fraction of the avatlable chip area. For this
reason, we consider only in-place schemes in this work. Basically,
the memory addresses of the in-place schedule can be generated
with little hardware overhead based on cyclically rotational property
[ I l l .
As known, the.lower hardware cost of the single butterfly
architecture is achieved at the price of degrading the throughput
rate of the pipelined version. According to the operational mode
I
defined in the Eureka-147 standard. we know that a ZME-point
FFT
operation should
be
completed within 1.25
m
Under such a
circumstance, it will be not possible to complete the desired
FFr
operation based on the radix-2 solution without memory partition
given
the chosen operational
frequency
of
4 576 MHz
In order
to
make the single buttemy architecture meet the DAB requirement,
memory partitioning becomes a cost-effective solution. In our
implementation, the single-port FFT RAM is divided into
r
= 2
banks to meet the timing requirement and the in-place scheduling
scheme is applied for saving memory spaces.
The address-generate unit shown in Figure 4 is designed to
generate addresses for tw o m emory banks and the coefficient
ROM.
The butterfly counter is used to sequentially generate the required
buttemy indices at stage one. The two barrel shifters first
concatenate their indices, respectively, with the current butterfly
index and then emulate the right rotational property of addresses at
the present stages specified by the stage counter. Finally, the MUX
is
to distribute the addresses based
on Eqs. 2)- 5)
such that the
output of each barrel shifter can be directed into the correct
memory bank. For the radix-? implementation. the control signal
Bank-index
is derived by performing bit-wise XOR operation
on
the original addresses according to Eq. (4).
In addition, the contents of the coefficient ROM and the
corresponding addressing
rules
can be easily decided by following
the data flow graph
of
DIT FFT computation. Note that w e only
need to store half the twiddle coefficients due to their symmetric
stored in the pth ROM address. Then, the ROM contents can be
accessed based on the current butterfly index
BI
and the present
property. Let the radix-2 twiddle coefficient
W p
= e j l n x P i N e
U-138
7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11
3/4
itage number r according to following equations.
k t
he binary
representation of the curren t butterfly index
be
given by
B l
= bn,.2,bn,.3 .b2.bl.bO)2
6 )
where m = l o g , N is the number of stages for the rad&-?
implementation. From the data flow graph. the elements hi s
of
B I
can be used
as
variables in conjunction with the value t to generate
proper ROM addresses. Specifically, we first generate a ve ctor from
the ,present value based on Eq. (7) and then the desired ROM
address
p B1, r
can be computed by using the vector
as
a mask to
filter out unwanted
b, s
according to Eq. 8).
-1 = [ q , , ,~2 .q n ,~J . . .~ .q 1 ,q O 1 2 .or
r =
1, 2 ,
....
m (7)
r-
Equation (7) can be easily implemented by resetting a s M register
and then shifting in a one from the least significant bit when the
stage advances once. And.
Eq.
(8)
represents the masked output of
the bit reversal of the current butterfly index. In both cases, their
implementation cost
is
almost negligible.
Bank-index
C ,
Figure 4 . The block diagram of the address-generate unit
3 2 Buttemy Unit
The butterfly unit is the core
of
F l T processors to determine
the desired clock speed and the resulting throughput. In this work,
the butterfly unit was designed with the simple rad&-2 DIT-FFT
algorithm.
As
shown in Figure
5 ,
the arithmetic operations consist
of calculating a pair
of
complex values,
A'=A+BW
and
B'=A-BW,
from a pair
of
complex inputs, A and
B,
and the twiddle coefficient
W .
Mulipliar ~
L - - - ~ _ _ _ _ - - - _ - _ _ _ _ _ _ _ -
Figure
5 .
The arithmetic
of
radix-2 DIT-FFT algorithm
For
a
butterfly unit without employing pipelining, the critical
path is the summation
of
the memory read operation. arithmetic
operation (multiplication and addition
of
complex numbers), and
memory write op eration. To.r educ e the critical path delay, we divide
the whole operations of the buttertly unit into
(s+?)
different steps
(the fust step for memory read operation, the following s steps for
arithmetic operation. and the last step for memory write operation)
as indicated in Figure
6.
Due to th e in-place computation. we have
to schedule the tasks assigned to the pipelined butterfly unit such
that no control hazard occurs during memory accesses. A
control
hazard
(see Figure
7 a))
results from the conflict when the butterfly
unit intends to access more than tw o data in the sam e memory
bank.
Figure 7(b) shows the schedule to eliminate the control hazard
providing that only the single-port memory h available in the
implementation. The arran gemen t of Figure 7(b) results in only 50%
hardware utilization
of
the pipelined butterfly unit.
On
the contrary,
100%hardware utilization can
be
achieved
if
the dual-port memory
is
employed in the design. Note that the area occupied by the
memory module is not only proportional to the number
of
stored
data, but it is
also
proponional to the number of ports. Obviously,
the chip area of
a
dud-port memory is much higher than that
of
a
single-port memory.
Since we use a 24.576
MHz
clock in our
FFT
processor, the
arithmetic operation can
be
fnished within one clock cycle
s
= I .
Each buttertly operation. thus, only takes three clock cycles, each
for memory read operation, arithmetic operation. and memory write
operation. In addition, only 50% hardware utilization is achieved
because the single-port memory is employed in our design
to
reduce
the hardware cost.
iz 02
w
m m
I
Read Computation
IWntc
Figure 6. Radix-2
DIT
pipelined butterfly unit
T . T r . - - , ~ , . , T . . T . - - T T . T
os.
I I R
C~
I c I4
(b)
Figure 7. (a) The control hazard. (b) The reconcile for control
hazard.
11-139
7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11
4/4
4
CHIP REALIZATION AND COMPARISON
AU
the modules in our design have been successfully
implemented based on the
TSMC
(Taiwan Semiconductor
Manufacturing Company) 0.35 jnn lP4M CMOS process and
simulated using Synopsys and Cadence tool. Based
on
the
speciiicatians of DAB channel demodulator, the resulting
FFT
processor is capable of completing the four operational modes
(mode
I:
2048 points, mode II: 512 points, mode I 1 1 256 points,
and mode
I V
1024 points) with
a
clock frequency of 24.576 MHz.
The corresponding physical layout
is
shown in Figure 8, in which it
inc ludes 2x1024~16
RAMs
(two banks, each containing
1 24x16
bits) and 2 x1 02 4~ 8 O Ms (one for the real part and another for the
imaginary part). In terms of the 2-input NAND gate, the total
number
of
gate counts is 4351, excluding the used memories. The
resulting core sue of the chip implementation is about 2086x1806
pn2and the o verall chip size including
U 0
ads is 2856x2594
pn .
3*( logy -1) CM '
4
*log:
Adder'
4* log:
Sub '
Figure 8. The layout of the developed FFT processor,
We compare the performance of our implementation with the
following T implementations: the pipelined architecture I71 and
the single butterfly architecture I S ] The circuit complexities of
these designs are compiled in Table I he pipelined architecture in
171 migh t be the preferred choice for high-speed applications, but it
is not suitable for the application of DAB system. The memory
bandwidth problem of
[ ]
is solved
by
introducing more
complicated structure (the radix-4 butterfly unit) and utilizing more
memory resources.
Note
that the operation frequency of [5] is
12.288MHz.
By
taking
advantages of
efficient
memory partition and
employing the pipelined butterfly unit, our design can reduce the
required
area
complexity and it still fits in the DAB specifications.
For DAB applications, it
is
clear that our design outperforms
Delaruelle's work.
5. CONCLUSION
Up to date, lots of efforts have
been
devoted to the
development of low-cost DAB products. Of the key techniques to
build
a
DAB receiver. the
FFT
is one of the key components, which
is
very suitable for
ASIC
implementation.
This
paper
explores
efficient solutions for hardware implementations of the T
processor such that they
can
fit in the specification of the Eureka-
147 standard under limited hardware resources.
AU
the functional
blocks
are
designed, simulated, and verified
using
the
Synopsys
and
Cadence software and the
f m d
layout is ready for
VLSI
fabrication
based on the
0.35 p n
TSMC
process and Compass
cell library.
CM 1 C M
4 Adder
1
Adder
ub
4 Sub
4 Registe
Results show that our implementation has the potentials of
consuming less silicon area and facilitating the extension for high
transmission rate requirement.
REFERENCES
[ I ]
ETS
300 401,
Radio
broadcasting system: Digital audio broadcasting
DAB)o
mobile. portable and fixed receivers ,
ETSI, 2'edition..
May
1997.
121
J
A. Husiken. F.
V. Lax.
A.
Delaruelle, and
N.
.
L.
Philips
Specification. partitioning and design
of
a DAB channel decoder. in
Proc.
VLSI Signal Processing Workhap, pp. 21-29. 1993.
131 M.
B o k .
D. Clawin,
K.
Gieske. F. Hofm nn. T. Mlasko, M.
J.
Ruf. and
G.
Spreitz The receiver engine
chipset for
digital audio broadcasting,
in hoc. URSI Int. Symp. Signals. System. and Electronics. pp. 338-34
1998.
A. Delamelk,
J.
Huisken. 1. V. Loan. and F. Welten. A chip set for
digital audio broadcasting channel decoder. in
hoc. IEEE
Custom
Integrated Circuit Coni..pp. 13.4.1-1 3.4.4. 1995.
151
A .
Delaruelle.
J. Huisken.
1. van Laan
and F. Welten. A channe
demodulator IC
for
digital audio broadcasting,'' in hoc. IEEE Custom
Integrated Circuits Conf. 1994. pp. 47-50 . 1994.
161 S.
He. and M. Torkelson. Design and implementation
of
a
1024-poi
pipeline F l T processor. in Proc. EEE Custom Integrated Circuits Con
pp. 131-134,1998.
171 E. Bidet, D. Castelain. C. Jaanblanq. and P.
Senn.
A
fast
single-chi
implementation
of
8192 complex paint
FTT.
IEEE
I.
Solid-Stat
Circuits, vol. 30. no. 3. pp. 300-305, March 1995.
E. Cedn. Richard C. S . Morling
and
I. Kale.
An
extensible complex fas
Fourier transform processor chip
for
real-time
specmm
analysis and
m~suremenf.
EEE Trans.
Instrumentation and Measuremnt. vol. 47
no.
1.
pp.95-99, Feb. 1998.
191
H. F. Lo, M. D. Shieh. and C. M. Wu, Design of
an
efficient FF
processor far DAB system in Proc. IEEE Inl. Symp. Circuits and
System. 654-657.2001
[IO1 E. 0
righam
The Fnsf Fourier Tonsform
and
ifs Application
Prentice-Hall
Inc..
1988.
[ I l l M. Biver, H. Kaeslin, and C. TormMsini. In-place updating of pat
metiics in Viterbi decaders, IEEE J . Solid-State C ircuits.
vol.
24.pp
1158-1159,Aug.1989.
141
181
Table 1. Comparison s of different implementations
No.
of butterfly
unit
Arithmetic
components
Gate counts of
arithmetic
components
Memory size
No.
of
clock
cycles
N =
2048
A . Delaruelle Proposed
.
Bidet
171
151
l o g y , radix-r
I , radix-4
I
. radix-2
8160*( log: -1)
+896* log:
9156 2954
2048
(dual- ort) 2x2048
4xA, )
2458 1I264 22528
Note:
(1)
C M %bit complex-number multiplier, (2) Ad d 16-bi
adder,
(3)
S u b 16-bit subtractor . (4) A ,
=
--log:, and
5 )
A2
4
11-140