ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

ELEC692 VLSI Signal Processing Architecture

Lecture 9VLSI Architecture for Discrete

Cosine Transform

Discrete Cosine Transform

• Frequency transform• Used for pattern recognition, image processing, still and

moving image and video processing• N-point sequence x(n), N-point DCT and IDCT pair is

defined as

1,...,1,0,2

)12(cos)()(

2)(

1,...,1,0,2

)12(cos)()()(

1

0

1

0

NkN

knkXke

Nnx

NkN

knnxkekX

N

k

N

n

where

otherwise

kifke

,1

0,2

1)(

N-point DCT/IDCT• N-point DCT and IDCT pair can be derived using a 2N-

point discrete Fourier transform (DFT) pair, using x(n) and its mirror image

12)12(

10)()12()()(

NnNnNx

NnnxnNxnxny

Y(n) is symmetric with respect to midpoint at n=N-1/2. The 2N-point DFT of y(n) is given by (for 0<= k <= 2N-1)

knN

jN

Nn

knN

jN

n

N

n

knN

j

D enNxenxenykY 2

2122

21

0

12

0

2

2

)12()()()(

Substituting n=2N-n’-1 into the second summation, we have

1

0'

2

2'

2

2

0

1'

)1'2(2

2122

2

)'(

)'()12(

N

n

kN

jknN

j

Nn

nNkN

jN

Nn

knN

j

eenx

enxenNx

N-point DCT (cont.)• Now we have

)2

)12(cos()(2

))()((

)()()(

1

0

2

2

1

0

2

)12(1

0

2

)12(

2

2

1

0

2

2

2

21

0

2

2

N

n

Nj

N

n

N

knjN

n

N

knj

Nj

N

n

kN

jknN

jN

n

knN

j

D

N

knnxe

enxenxe

eenxenxkY

Define

otherwise

NkekYkXN

kj

D

,0

10,)()(ˆ)

2(

N-point DCT can be expressed as 2/)(ˆ)()( kXkekX

N-point DCT/IDCT

• N-point 1D-DCT requires N2 multiplications and addition.• For image compression, N X N blocks need N X N 2D

DCT.

• Direct computation of 2D-DCT of length N requires N4 multiplications and additions.

• Using the separability of 2D-DCT, it can be computed by performing N 1D-DCT on the rows of the image block followed by N 1D-DCT on the resulting column.

• Complexity reduced to 2N3 multiply-add operations or 4N3 arithmetic operations.

N

kn

N

knx

N

ncncY nn

N

n

N

nkk 4

)12(2cos

4

)12(2cos

)()(2 2211,

1

0

1

02

21, 21

1

21

2D DCT• The 2-D Discrete Cosine Transform has shown

to be separable, i.e., it can be expressed as two consecutive l-D transforms.

• Observe that in X and x are 2-D (NxN) data matrices. A 2-D transform can now be calculated using an 1-D transform hardware unit twice, making a matrix transposition of the intermediate result in between.

TAxAXDDCT

AxXDDCT

:2

:1

Block diagram and timing diagram of DCT core processor

Algorithm-Architecture Transformation of DCT

• A hierarchical way to adapt an architecture to a given algorithm or change the algorithm’s description in a systematic way.

• The multiplication of DCT can be reduced using this technique, e.g. 8-point DCT

7,...,1,0)16

)12(cos()()(

7

0

kkn

nxakyn

k

Combining ak and the cosine expression into one coefficient bn,k, we have the following dataflow graph


)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

9271331173217

26142221030186

1112313325155

28201242820124

137127211593

30262218141062

15131197531

44444444

x

x

x

x

x

x

x

x

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

y

y

y

y

y

y

y

y

16cos

ici

We can write the dataflow graph as a matrix form

where

Transformation in 3 steps

1st step, systematically modify the DCT algorithm, here using trigonometric properties

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

75311357

62266226

51733715

44444444

37155173

26622662

13577531

44444444

x

x

x

x

x

x

x

x

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

y

y

y

y

y

y

y

y


• Then the 8-point DCT can be rewritten as

4100

4100

211610

611210

73123150

13725130

33521170

53327110

)0(

)4(

)6(

)2(

)5(

)3(

)7(

)1(

cPy

cMy

cMcMy

cMcMy

cMcMcMcMy

cMcMcMcMy

cMcMcMcMy

cMcMcMcMy

where

11101001110100

32113211

10101010

520523

610612

431431

700700

PPPPPM

PPPPPP

PPPPPM

xxPxxM

xxPxxM

xxPxxM

xxPxxM



• Step 2 transformation: DCT structure is grouped into different functional units represented by blocks and then the whole DCT structure is transformed into a block diagram.

• Two major blocks

+

+-

x(0)

x(1)

x(0)+x(1)

x(0)-x(1)

+

+

x(0)

x(1)

ax(0)+bx(1)

bx(0)-ax(1)

a

a

bb


• The transformed block diagram is:


• Step 3- reduce complexity of the implementations of the blocks.

• The block can be realized by using 3 multiplications and 3 additions instead of 4.

• Define the block with a=sin, and b = cos, and reversed outputs as a rotator block that computes

y

x

y

x

cossin

sincos

'

'

Other transformations


• Final architecture

13 multiplications, 31 additions

Decimation-in Frequency Fast DCT for 2m-Point IDCT

• DIF commonly used in DFT.

• Reduce the # of multiplications to about (N/2)log2N by power-of-2 decomposition.

• For simplicity the 2/N scaling factor is ignore. We have )()()(ˆ kXkekX

Fast DCT/IDCT (FCT)– Decomposing into even and odd indexes of k

)(

12/

0

)(

12/

0

)2

)12)(12(cos()12()12()

2

2)12(cos()2()2()(

nh

N

k

ng

N

k N

knkXke

N

knkXkenx

IDCTpoNN

knkXke

N

knkXkeng

N

k

N

k

int2/))12(

cos()2()2()2

2)12(cos()2()2()(

12/

0

12/

0

For h’(n) we use )2

)1(2)12(cos()

2

2)12(cos()

2

)12)(12(cos()

2

)12(cos(2

N

kn

N

kn

N

nk

N

n

We have

IDCTpoNN

knkXkekXke

N

knkXkekXke

N

knkXke

N

knkXke

N

knkXke

N

knkXkenh

N

n

N

k

N

k

N

k

N

k

N

k

N

k

int2/......))12(

(cos)]12()12()12()12([

)2

2)12((cos)]12()12()12()12([

)2

2)12((cos)12()12()

2

2)12((cos)12()12(

)2

)1(2)12((cos)12()12()

2

2)12((cos)12()12()(')

2

)12(cos(2

12/

0

12/

0

12/

0

12/

0

12/

0

12/

0

0)12()12( 0 kkXke

2cosAcosB=cos(A+B)+cos(A-B)

N-point IDCT can be decomposed using N/2-point IDCT

)( IDCT,point 2/

12/

0

)( IDCT,point 2/

12/

0

))12(

cos()12()12()12()12()

2

)12(cos(2

1

))12(

cos()2()2()(

nhN

N

k

ngN

N

k

N

knkXkekXke

N

n

N

knkXkenx

)(')

2)12(

cos(2

1)()1(

)(')

2)12(

cos(2

1)()(

nh

Nn

ngnNx

nh

Nn

ngnx

N-point IDCT Architecture

N-point IDCT

k NkC

k N

1 2

0 2 122 1/

: ~ /

+

+

( )0

( / )N 2 1

( )N 2

( / )N 2 1

Even-OddIndex

Mapping

+

++

OddSummation

X ( )0

X ( )1

X N( ) 2

X N( ) 1

X ( )0

X ( )2

X N( ) 2

X N( ) 4

+

+

+

+

( )1

( / )N 2 2

( / )N 2

( )N 1X N( ) 1

X N( ) 3

X N( ) 5

X ( )3

X ( )1 0

1

N /2 2

N /2 1

N/2-pointIDCTh'(k)

N/2-pointIDCTg(k)

+

+

Re-order

• Since

N

n

N

nNN

kn

N

knN

2

)12(cos

2

)1)1(2(cos

)12(cos

)1)1(2(cos

N-point IDCT can be expressed in terms of two N/2-point IDCT. By repeating this process, the IDCT can be decomposed further until it can be expressed in terms of 2-point IDCTs (DCT can be decomposed in a similar fashion)

2-point IDCT butterfly architecture

4cos)1(ˆ)0(ˆ)1(

4cos)1(ˆ)0(ˆ)0(

XXx

XXx

Cos(/4)

-1

)0(X̂

)1(X̂

x(0)

x(1)

E.g 8-point IDCT

3,2,1,0)(

16)12(

cos2

1)()7(

)(

16)12(

cos2

1)()(

)2/(2

)12(cos)()(

)2/(2

)12(cos)()(

3,2,1,0)12(ˆ)12(ˆ)(

)2(ˆ)(

3

0

3

0

kkhk

kgkx

khk

kgkx

N

nknHkh

N

nknGkg

nnXnXnH

nXnG

n

n

8-point IDCT architecture

Complexity comparison

Multiplier-less DCT architecture

• Using distributed arithmetic• More area-efficient realization of hardware• Replacement of multipliers by memory

look-up table• Regularity of the highly concurrent

structure allows modular design of the circuit

• Bit-serial and bit-parallel structure – saving area and ease of routing

Distributed Arithmetic (B. Liu -74)

• The most-often encountered form of computation in DSP:

– Sum of product

– Dot-product

– Inner-product

• Distributed arithmetic (DA) is used to design bit-level architectures for vector-vector multiplications (inner products)

– Each word in the vectors is represented as a binary number

– The multiplications are re-ordered and mixed such that the arithmetic becomes “distributed” through the structure

Technical Overview of DA

• Advantage of DA: Efficiency of computing mechanization

• A frequently argued:

– Slowness because of its inherent bit-serial nature

– Some modifications to increase the speed by employing techniques:

– Plus more arithmetic operations

– expense of exponentially increased memory

Conventional distributed arithmetic• An inner product between 2 length-N vectors C

and X

• Where {ci}’s are M-bit constants and {xi}s are coded as W-bit 2’s complement numbers as follows

– Now substituting the above equation, we have

1

0

N

iiixcY

1

11,1, 2

W

j

jjWiWii xxx

1

0

1

11,

1

01,

1

0

1

11,1,

2)(

)2(

N

i

jW

jjWi

N

iiWii

N

i

jW

jjWiWii

xcxc

xxcY

Conventional distributed arithmetic

• Define

• Then

• By interchanging the summing order of i and j, the initial multiplications are now distributed to another computation pattern.

• Since the term Cj depends on xi,j values and has only 2N possible values, it is possible to pre-compute them and store them in a ROM

• An input set of N bits (x0j,x1j,…,xN-1,j) is used as an address to get Cj values

• These intermediate results are accumulated in W clock cycles to produce one Y value.

1

0

1

01,11,1 ),0(

N

i

N

iWiiWjWiijW xcCjxcC

1

01 2

W

j

jjWCY

Example Content of ROM (N=4)

Architecture of computing inner product of two length-N vectors using DA

The results is obtained after W clock cycles.This is called bit-serial distributed arithmetic.Speed is limited because it takes W cycles

Speeding up bit-serial DA

• Use digit-serial distributed arithmetic, where a digit containing multiple bits is processed in a clock cycle

• E.g. if J consecutive bits are processed in a single clock cycle using J ROMs, then the input words are processed in W/J clock cycles.

• A multi-input shift-accumulator adds the contents of J ROMs and the previous accumulated results

DA with Offset-Binary Coding• Offset-Binary Coding can be used to reduce the ROM

size by a factor of 2.

]22)()([2

1

)]([2

1

)1(1,1,

1

11,1,

WjjWijWi

W

jWiWi

iii

xxxx

xxx

)1(1,

1

11, 22

Wj

jWi

W

jWii xxx

}1,1{1,

1,,

1,1,

,,, ji

WiWi

jijiji d

Wjforxx

Wjforxxd

1

0

)1(1, 22

2

1 W

j

WjjWii dx

Where

Define

(eqn.1)

Eqn 1 can be rewritten as (eqn.2)

DA with Offset-Binary Coding

)1(1

0

1

01,

1

0

1

0

)1(1,

1

0

2)2

1(2)

2

1(

]22[2

1

WN

i

N

ii

jjWii

W

j

W

j

WjjWii

N

i

cdc

dcY

i

N

iextra

jii

N

ij

cDand

WjfordcD

1

0

,

1

0

2

1

10,2

1

Using eqn. 2, the original Y can be written as

1

0

)1(1 22

W

j

Wextra

jjW DDY

Now define

We have

Content of the ROM with OBC Coding (N=4)

• Table 13.3Dj values are mirrored, therefore Dj has only 2N-1 possible values depending on the xi,j values and the ROM size is reduced by 2

Architecture with OBC coding

ROM decomposition for DA• ROM size increased exponentially with N

– ROM access time can be a bottleneck esp. when N is large– Reducing the size of ROM is important

• Solution– Divide the N address bits into N/K groups of K bits– Decompose the ROM of size 2N into N/K ROMs of size 2K– Add the outputs of these ROM using a multi-input

accumulator– Reduction of the storage size is balanced by a linear

increase of the computation complexity of the accumulator– Carry-save arithmetic can be used to realize the multi-input

accumulator to minimize the computation time

Multi-input accumulatorCPA: carry propagate adderCSA: carry-save adder

Delay = NTfa Delay = 4Tfa Delay = 3Tfa

More register

Architecture with ROM decomposition

Conclusion on DA• DA is a very efficient mechanism for computations that are

dominated by inner products (convolution)

• A good way to trade combinational logic with memory for high-performance computation.

• When a many computing methods are compared, DA should be considered. It is not always (but often) best, and never poorly: save gate count around 50% to 80%.

• Application: “VLSI implementation of a 16*16 discrete cosine transform,” by M.-T. Sun, T.-C. Chen, A. M. Gottlieb, IEEE Transactions on Circuits and Systems, Volume: 36 Issue: 4 , April 1989, Page(s): 610 –617, and many other transforms and DSP kernels.

DCT architecture using DA

For small size DCT, we can use combinational logic (CB) to implement the ROM. This will reduce the critical path delay

Documents

ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform