Download pdf - IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

1/4

IMPLEMENTATION OF CHANNEL DEMODULATOR

FOR DAB SYSTEM

C h i e n - M i n g Wu', Min g-D er Sh ieh ', Hs in-Fu Lo ,

and

M i n - H s i u n g

HuZ

G r a d u a t e S c h o o l of E n g i n e e r i n g Science T e c h n o l og y , N a t i o n a l Yunlin U n i v e r s i t y of S c i e n c e T e c h n o l og y , T a i w a n

D e p a r t m e n t of E l e c t r o n i c E n g i n e e r i n g ,

National Yunlin

U n i v e r s i t y

of

Science

T e c h n o l o g y , T a i w a n

D i v i s i o n of Design S e r v i c e, N a t i o n S c i e n c e C o u n c i l C h i p I m p l e m e n t a t io n C e n t e r ( C I C ) , T a i w a n

2

ABSTRACT

This paper describes the VLSI implementation of Fast Fourier

Transform FIT) for the . Eureka-147 Digital Audio Broadcasting

(DAB) system. We emphasize how

' t o

m i n i i e t h e ha rd wa re

requirement and efficiently manage the memory to meet the DAB

requirement. Implementation results demo nstrate the applicability of

our w ork w ith the characteristics

of

modular design, consuming less

silicon area, and facilitating the extension for high transmission rate

applications. The core size of the resulting chip implementation is

2086x1806 pmz

based o n the TSMC 0.35 1P4M CMOS

process. Performance evaluation reveals that our design for the

targeted channel demodulator o utperform previous solutions.

1.

INTRODUCTION

Th e Digital Audio Broadcasting (DAB) system, described in the

European Eureka-I47 standard [I], offers high-quality audio

services, supports multimedia data

to

mobile reception and might

replace the traditional radio system. Basically, two strategies are

employed to implement the DAB receiver: the DSP-based

architecture [Z,

1

and the ASIC-based implementation [4, 51. The

former has the characteristics of maximum flexibility, ease of

use

and sImple programming, but it can only provide limited processing

ca pa b ty . 0n .th e contrary, the ASIC-based implementation has the

potentials of: supporting real-time symbol decoding and low-cost

Implementation.

Figure 1: shows an o verview

of

the DAB system, in which the

ISONPEG coding is adopted for source coding and COFDM

(Coded Orthogonal Frequency Division Multiplexing) for channel

coding and' modulation [I]. After convolutional coding, the

generated codewords are interleaved in frequency for the fast

information channel and in both time and.frequency for the main

service channel, and then the OFDM modulation is performed.

In

this paper, we focus on the design and implementation of the

channel demodulator, which essentially perform a Fast Fourier

Transform

(FFT). In

general, two basic types of

F FT

architectures

can be found in the literature: the pipelined orchirecture with each

stage consisting of a butterfly unit 16, 71 and the single burrerfly

architecture

1.5, 81 that employs just one radix-r butterfly unit. The

main concern s the trade-off between hardware o verhead and speed

requirement.

Although the pipelined architecture can provide a higher

throughput rate than the single butterfly implementation, we are still

interested in the single butterfly architecture because of the

specifcations of the channel demodulator as well as the hardware

considerations on the implementation of DAB receivers. For the

single butterfly Implementation, a basic problem that arises

is

how

to

eEciently mange memory readwrite accesses for the purposes

of increasing

its

throughput rate. The common

solutions

include: (1)

Use the high-radix implementation to reduce the

total

number of

memory accesses at the expense of increasing the arithmetic

complexity, i.e., the hardware requirement of a high-radix butterfly,

unit.

(2)

Partition the memory into several banks in order

to

allow

concurrent accesses of multiple data with a more complicated

addressing scheme, which might correspond to a higher routing area.

In this paper, we describe the design and implementation of the

FIT for the DAB ch annel demodulator. We show our experiences

on

how to use the conflict-free memory addressing arrangement in

191

to minimize the hardware requirement and to match the DAB

requirement. Implementation results de monstrate the applicability of

our work

to

the targeted channel demodulator and the advantages

over previous solutions [ 5 , 71 in terms of hardware requirement.

The rest of this paper

is

organized

as

follows: Section

2

reviews the

background and ou r previous work [ 9] related to this paper. Section

3 describes the resulting architecture and design of

FFT

processor.

Then, the corresponding chip implementation and performance

evaluation are shown in S ection 4. Finally, Section 5 concludes this

work.

conYolul ionald i n g

OFDM transminer

d i n g

inrerIEaving

Chaskd

N o m and Retlcclion

Figure

1 An

overview of the DAB system [SI.

2

PRELIMINARY RESULTS

x(k) is defmed as

The N-point Discrete Fourier Transform (DFT)

of a

sequence

where

n

=

0, 1,

..., N-l and W

=

e-J2 . From Eq.

( I ) ,

we know

that N2 multiplications and N(N-1) additions are needed to directly

perform the required computations. By applying the

FIT,

the

computational complexity can be down to a number in O(N log M .

If the number of sampled points is a power of the radix r, then it

is easy

to

compute the D F I by using a radix-r

FF'I

algorithm In

such a case, the N-point DFT can be decomposed into a set of

recursively related r-point transforms. The decimation

in

time (DIT)

and decimation in frequency (DIF) are two basic classes of

FIT

algorithm [lo]. Specifically, the DIT FF I algorithm is based on

decomposing the input sequence x(k) into successively smaller and

smaller subsequences. The DIF

F FT

algorithm is

to

decompose the

output sequence

X n )

into smaller subsequences in the same way.

Figure

2

shows a DIT 8-point

FIT

algorithm, in which the data in

each stage can be processed based

on

the so-called butterfly units.

E137

-7803-7761-31031117.002003

EEE


2/4

Figure 2. The data flow graph of D IT E-point I omputation

In

general, an N-point I computation requires (N/r)xlog,N

radix-r butterfly computations and either the pipelined architecture

or the single butterfly architecture can be selected for a dedicated

application. For the single butterfly implementation, it implies

2Nxlog,N mem ory accesses, which are the main bottlenecks for fast

FFT computation. Therefore, we need an efficient memory

management strategy to overcome this problem, i.e., to reduce the

number of memory accesses

or

to increase the memory bandwidth.

In

our previous work

[91,

we have presented a set of simple but

efficient equations to partition the memory into a number

of

memory banks such that the equivalent memory bandwidth can be

increased with simple interconnection networks.

As

known, let be the numbe r of stage s for the

FFT

computation, then the value can be compu ted by

m = l l o g r

(2)

Following the notation of conventional number system, it is

assumed that the original memory address

4, is

expressed in

unsigned radix-r representation defined as

where

di is an

integer and

0 5 di 5 r-I. In

consequence, a feasible

solution to partition the memory into

r

banks can be easily obtained

as

shown in

Eq. 4),

which implies that the original address

(4,

will

be distributed into the bank number

B d, r .

The correctness of

Eq.

(4) is assured by observing that for a

given

butterfly index, the

equation con tains the distinguishable variable at each stage.

4.

& . I . c L . 2 . .

..

a,

d .

do , (3)

B d, r ) = d,,., d,,,. d z d l 4

mod

r

(4)

Finally, we consider the mapping of 4,nto one of the address

locations of the selected bank B d,

r .

To simp@ the hardware

implementation, the

assigned

address

BA d, r in

the bank

B d, r ) is

obtained by discarding the least significant digit of the original

address. Equation 5 ) causes

no

conflict due to the fact that for two

original addresses that differ in only the least signifcant digit, they

are distributed into different banks based on

Eq.

4) because of 0 S

d0Sr- l .

BA d. r)

=

dn,.t. d,,,.. ...,

4.

4

5 )

3 FFT DESIGN AND IMPLEMENTATION

Figure

3

depicts the block diagram o f the single butterfly

architecture for our

FIT

processor. It operates

on

a 24.576 MHz

clock and consists of a simple radix-2 DIT butterfly unit, a single-

port

FFT

RAM, a coefficient RO M, a control unit, and an address-

generate unit (AGU).

AU

variables are complex and the intemal

datapath widths are either 8 o r

16

bits. The details of the

VLSI

realization are described in the following subsections.

onlml

Unit

eh

acff. Butlsrt lynit

ROM

Figure

3.

Block diagram of the Tprocessor,

3 1MemoryArrangement

For memory arrangement, first we have to.decide whether the

ping-pong mode or in-place mode

is

to be applied to store the

intermediate values when implementing the FFT

RAM.

The main

disadvantage of the former is that twice

as

many memory spaces are

required in comparison with the in-place operation, but the control

circuit is easy, For in-place scheduling, exactly one memory space is

needed for storing the intermediate values and the old computed

values are immediately overwritten by the newly computed values.

This

is an

important feature for the realization of long

FFTs

due to

the fact that area for storing the large amount of intermediate results

will occupy a significant fraction of the avatlable chip area. For this

reason, we consider only in-place schemes in this work. Basically,

the memory addresses of the in-place schedule can be generated

with little hardware overhead based on cyclically rotational property

[ I l l .

As known, the.lower hardware cost of the single butterfly

architecture is achieved at the price of degrading the throughput

rate of the pipelined version. According to the operational mode

I

defined in the Eureka-147 standard. we know that a ZME-point

FFT

operation should

be

completed within 1.25

m

Under such a

circumstance, it will be not possible to complete the desired

FFr

operation based on the radix-2 solution without memory partition

given

the chosen operational

frequency

of

4 576 MHz

In order

to

make the single buttemy architecture meet the DAB requirement,

memory partitioning becomes a cost-effective solution. In our

implementation, the single-port FFT RAM is divided into

r

= 2

banks to meet the timing requirement and the in-place scheduling

scheme is applied for saving memory spaces.

The address-generate unit shown in Figure 4 is designed to

generate addresses for tw o m emory banks and the coefficient

ROM.

The butterfly counter is used to sequentially generate the required

buttemy indices at stage one. The two barrel shifters first

concatenate their indices, respectively, with the current butterfly

index and then emulate the right rotational property of addresses at

the present stages specified by the stage counter. Finally, the MUX

is

to distribute the addresses based

on Eqs. 2)- 5)

such that the

output of each barrel shifter can be directed into the correct

memory bank. For the radix-? implementation. the control signal

Bank-index

is derived by performing bit-wise XOR operation

on

the original addresses according to Eq. (4).

In addition, the contents of the coefficient ROM and the

corresponding addressing

rules

can be easily decided by following

the data flow graph

of

DIT FFT computation. Note that w e only

need to store half the twiddle coefficients due to their symmetric

stored in the pth ROM address. Then, the ROM contents can be

accessed based on the current butterfly index

BI

and the present

property. Let the radix-2 twiddle coefficient

W p

= e j l n x P i N e

U-138


3/4

itage number r according to following equations.

k t

he binary

representation of the curren t butterfly index

be

given by

B l

= bn,.2,bn,.3 .b2.bl.bO)2

6 )

where m = l o g , N is the number of stages for the rad&-?

implementation. From the data flow graph. the elements hi s

of

B I

can be used

as

variables in conjunction with the value t to generate

proper ROM addresses. Specifically, we first generate a ve ctor from

the ,present value based on Eq. (7) and then the desired ROM

address

p B1, r

can be computed by using the vector

as

a mask to

filter out unwanted

b, s

according to Eq. 8).

-1 = [ q , , ,~2 .q n ,~J . . .~ .q 1 ,q O 1 2 .or

r =

1, 2 ,

....

m (7)

r-

Equation (7) can be easily implemented by resetting a s M register

and then shifting in a one from the least significant bit when the

stage advances once. And.

Eq.

(8)

represents the masked output of

the bit reversal of the current butterfly index. In both cases, their

implementation cost

is

almost negligible.

Bank-index

C ,

Figure 4 . The block diagram of the address-generate unit

3 2 Buttemy Unit

The butterfly unit is the core

of

F l T processors to determine

the desired clock speed and the resulting throughput. In this work,

the butterfly unit was designed with the simple rad&-2 DIT-FFT

algorithm.

As

shown in Figure

5 ,

the arithmetic operations consist

of calculating a pair

of

complex values,

A'=A+BW

and

B'=A-BW,

from a pair

of

complex inputs, A and

B,

and the twiddle coefficient

W .

Mulipliar ~

L - - - ~ _ _ _ _ - - - _ - _ _ _ _ _ _ _ -

Figure

5 .

The arithmetic

of

radix-2 DIT-FFT algorithm

For

a

butterfly unit without employing pipelining, the critical

path is the summation

of

the memory read operation. arithmetic

operation (multiplication and addition

of

complex numbers), and

memory write op eration. To.r educ e the critical path delay, we divide

the whole operations of the buttertly unit into

(s+?)

different steps

(the fust step for memory read operation, the following s steps for

arithmetic operation. and the last step for memory write operation)

as indicated in Figure

6.

Due to th e in-place computation. we have

to schedule the tasks assigned to the pipelined butterfly unit such

that no control hazard occurs during memory accesses. A

control

hazard

(see Figure

7 a))

results from the conflict when the butterfly

unit intends to access more than tw o data in the sam e memory

bank.

Figure 7(b) shows the schedule to eliminate the control hazard

providing that only the single-port memory h available in the

implementation. The arran gemen t of Figure 7(b) results in only 50%

hardware utilization

of

the pipelined butterfly unit.

On

the contrary,

100%hardware utilization can

be

achieved

if

the dual-port memory

is

employed in the design. Note that the area occupied by the

memory module is not only proportional to the number

of

stored

data, but it is

also

proponional to the number of ports. Obviously,

the chip area of

a

dud-port memory is much higher than that

of

a

single-port memory.

Since we use a 24.576

MHz

clock in our

FFT

processor, the

arithmetic operation can

be

fnished within one clock cycle

s

= I .

Each buttertly operation. thus, only takes three clock cycles, each

for memory read operation, arithmetic operation. and memory write

operation. In addition, only 50% hardware utilization is achieved

because the single-port memory is employed in our design

to

reduce

the hardware cost.

iz 02

w

m m

I

Read Computation

IWntc

Figure 6. Radix-2

DIT

pipelined butterfly unit

T . T r . - - , ~ , . , T . . T . - - T T . T

os.

I I R

C~

I c I4

(b)

Figure 7. (a) The control hazard. (b) The reconcile for control

hazard.

11-139


4/4

4

CHIP REALIZATION AND COMPARISON

AU

the modules in our design have been successfully

implemented based on the

TSMC

(Taiwan Semiconductor

Manufacturing Company) 0.35 jnn lP4M CMOS process and

simulated using Synopsys and Cadence tool. Based

on

the

speciiicatians of DAB channel demodulator, the resulting

FFT

processor is capable of completing the four operational modes

(mode

I:

2048 points, mode II: 512 points, mode I 1 1 256 points,

and mode

I V

1024 points) with

a

clock frequency of 24.576 MHz.

The corresponding physical layout

is

shown in Figure 8, in which it

inc ludes 2x1024~16

RAMs

(two banks, each containing

1 24x16

bits) and 2 x1 02 4~ 8 O Ms (one for the real part and another for the

imaginary part). In terms of the 2-input NAND gate, the total

number

of

gate counts is 4351, excluding the used memories. The

resulting core sue of the chip implementation is about 2086x1806

pn2and the o verall chip size including

U 0

ads is 2856x2594

pn .

3*( logy -1) CM '

4

*log:

Adder'

4* log:

Sub '

Figure 8. The layout of the developed FFT processor,

We compare the performance of our implementation with the

following T implementations: the pipelined architecture I71 and

the single butterfly architecture I S ] The circuit complexities of

these designs are compiled in Table I he pipelined architecture in

171 migh t be the preferred choice for high-speed applications, but it

is not suitable for the application of DAB system. The memory

bandwidth problem of

[ ]

is solved

by

introducing more

complicated structure (the radix-4 butterfly unit) and utilizing more

memory resources.

Note

that the operation frequency of [5] is

12.288MHz.

By

taking

advantages of

efficient

memory partition and

employing the pipelined butterfly unit, our design can reduce the

required

area

complexity and it still fits in the DAB specifications.

For DAB applications, it

is

clear that our design outperforms

Delaruelle's work.

5. CONCLUSION

Up to date, lots of efforts have

been

devoted to the

development of low-cost DAB products. Of the key techniques to

build

a

DAB receiver. the

FFT

is one of the key components, which

is

very suitable for

ASIC

implementation.

This

paper

explores

efficient solutions for hardware implementations of the T

processor such that they

can

fit in the specification of the Eureka-

147 standard under limited hardware resources.

AU

the functional

blocks

are

designed, simulated, and verified

using

the

Synopsys

and

Cadence software and the

f m d

layout is ready for

VLSI

fabrication

based on the

0.35 p n

TSMC

process and Compass

cell library.

CM 1 C M

4 Adder

1

Adder

ub

4 Sub

4 Registe

Results show that our implementation has the potentials of

consuming less silicon area and facilitating the extension for high

transmission rate requirement.

REFERENCES

[ I ]

ETS

300 401,

Radio

broadcasting system: Digital audio broadcasting

DAB)o

mobile. portable and fixed receivers ,

ETSI, 2'edition..

May

1997.

121

J

A. Husiken. F.

V. Lax.

A.

Delaruelle, and

N.

.

L.

Philips

Specification. partitioning and design

of

a DAB channel decoder. in

Proc.

VLSI Signal Processing Workhap, pp. 21-29. 1993.

131 M.

B o k .

D. Clawin,

K.

Gieske. F. Hofm nn. T. Mlasko, M.

J.

Ruf. and

G.

Spreitz The receiver engine

chipset for

digital audio broadcasting,

in hoc. URSI Int. Symp. Signals. System. and Electronics. pp. 338-34

1998.

A. Delamelk,

J.

Huisken. 1. V. Loan. and F. Welten. A chip set for

digital audio broadcasting channel decoder. in

hoc. IEEE

Custom

Integrated Circuit Coni..pp. 13.4.1-1 3.4.4. 1995.

151

A .

Delaruelle.

J. Huisken.

1. van Laan

and F. Welten. A channe

demodulator IC

for

digital audio broadcasting,'' in hoc. IEEE Custom

Integrated Circuits Conf. 1994. pp. 47-50 . 1994.

161 S.

He. and M. Torkelson. Design and implementation

of

a

1024-poi

pipeline F l T processor. in Proc. EEE Custom Integrated Circuits Con

pp. 131-134,1998.

171 E. Bidet, D. Castelain. C. Jaanblanq. and P.

Senn.

A

fast

single-chi

implementation

of

8192 complex paint

FTT.

IEEE

I.

Solid-Stat

Circuits, vol. 30. no. 3. pp. 300-305, March 1995.

E. Cedn. Richard C. S . Morling

and

I. Kale.

An

extensible complex fas

Fourier transform processor chip

for

real-time

specmm

analysis and

m~suremenf.

EEE Trans.

Instrumentation and Measuremnt. vol. 47

no.

1.

pp.95-99, Feb. 1998.

191

H. F. Lo, M. D. Shieh. and C. M. Wu, Design of

an

efficient FF

processor far DAB system in Proc. IEEE Inl. Symp. Circuits and

System. 654-657.2001

[IO1 E. 0

righam

The Fnsf Fourier Tonsform

and

ifs Application

Prentice-Hall

Inc..

1988.

[ I l l M. Biver, H. Kaeslin, and C. TormMsini. In-place updating of pat

metiics in Viterbi decaders, IEEE J . Solid-State C ircuits.

vol.

24.pp

1158-1159,Aug.1989.

141

181

Table 1. Comparison s of different implementations

No.

of butterfly

unit

Arithmetic

components

Gate counts of

arithmetic

components

Memory size

No.

of

clock

cycles

N =

2048

A . Delaruelle Proposed

.

Bidet

171

151

l o g y , radix-r

I , radix-4

I

. radix-2

8160*( log: -1)

+896* log:

9156 2954

2048

(dual- ort) 2x2048

4xA, )

2458 1I264 22528

Note:

(1)

C M %bit complex-number multiplier, (2) Ad d 16-bi

adder,

(3)

S u b 16-bit subtractor . (4) A ,

=

--log:, and

5 )

A2

4

11-140