40
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Embed Size (px)

Citation preview

Page 1: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

ELEC692 VLSI Signal Processing Architecture

Lecture 9VLSI Architecture for Discrete

Cosine Transform

Page 2: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Discrete Cosine Transform

• Frequency transform• Used for pattern recognition, image processing, still and

moving image and video processing• N-point sequence x(n), N-point DCT and IDCT pair is

defined as

1,...,1,0,2

)12(cos)()(

2)(

1,...,1,0,2

)12(cos)()()(

1

0

1

0

NkN

knkXke

Nnx

NkN

knnxkekX

N

k

N

n

where

otherwise

kifke

,1

0,2

1)(

Page 3: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

N-point DCT/IDCT• N-point DCT and IDCT pair can be derived using a 2N-

point discrete Fourier transform (DFT) pair, using x(n) and its mirror image

12)12(

10)()12()()(

NnNnNx

NnnxnNxnxny

Y(n) is symmetric with respect to midpoint at n=N-1/2. The 2N-point DFT of y(n) is given by (for 0<= k <= 2N-1)

knN

jN

Nn

knN

jN

n

N

n

knN

j

D enNxenxenykY 2

2122

21

0

12

0

2

2

)12()()()(

Substituting n=2N-n’-1 into the second summation, we have

1

0'

2

2'

2

2

0

1'

)1'2(2

2122

2

)'(

)'()12(

N

n

kN

jknN

j

Nn

nNkN

jN

Nn

knN

j

eenx

enxenNx

Page 4: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

N-point DCT (cont.)• Now we have

)2

)12(cos()(2

))()((

)()()(

1

0

2

2

1

0

2

)12(1

0

2

)12(

2

2

1

0

2

2

2

21

0

2

2

N

n

Nj

N

n

N

knjN

n

N

knj

Nj

N

n

kN

jknN

jN

n

knN

j

D

N

knnxe

enxenxe

eenxenxkY

Define

otherwise

NkekYkXN

kj

D

,0

10,)()(ˆ)

2(

N-point DCT can be expressed as 2/)(ˆ)()( kXkekX

Page 5: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

N-point DCT/IDCT

• N-point 1D-DCT requires N2 multiplications and addition.• For image compression, N X N blocks need N X N 2D

DCT.

• Direct computation of 2D-DCT of length N requires N4 multiplications and additions.

• Using the separability of 2D-DCT, it can be computed by performing N 1D-DCT on the rows of the image block followed by N 1D-DCT on the resulting column.

• Complexity reduced to 2N3 multiply-add operations or 4N3 arithmetic operations.

N

kn

N

knx

N

ncncY nn

N

n

N

nkk 4

)12(2cos

4

)12(2cos

)()(2 2211,

1

0

1

02

21, 21

1

21

Page 6: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

2D DCT• The 2-D Discrete Cosine Transform has shown

to be separable, i.e., it can be expressed as two consecutive l-D transforms.

• Observe that in X and x are 2-D (NxN) data matrices. A 2-D transform can now be calculated using an 1-D transform hardware unit twice, making a matrix transposition of the intermediate result in between.

TAxAXDDCT

AxXDDCT

:2

:1

Page 7: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Block diagram and timing diagram of DCT core processor

Page 8: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• A hierarchical way to adapt an architecture to a given algorithm or change the algorithm’s description in a systematic way.

• The multiplication of DCT can be reduced using this technique, e.g. 8-point DCT

7,...,1,0)16

)12(cos()()(

7

0

kkn

nxakyn

k

Combining ak and the cosine expression into one coefficient bn,k, we have the following dataflow graph

Page 9: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

9271331173217

26142221030186

1112313325155

28201242820124

137127211593

30262218141062

15131197531

44444444

x

x

x

x

x

x

x

x

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

y

y

y

y

y

y

y

y

16cos

ici

We can write the dataflow graph as a matrix form

where

Transformation in 3 steps

1st step, systematically modify the DCT algorithm, here using trigonometric properties

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

)7(

)6(

)5(

)4(

)3(

)2(

)1(

)0(

75311357

62266226

51733715

44444444

37155173

26622662

13577531

44444444

x

x

x

x

x

x

x

x

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

y

y

y

y

y

y

y

y

Page 10: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• Then the 8-point DCT can be rewritten as

4100

4100

211610

611210

73123150

13725130

33521170

53327110

)0(

)4(

)6(

)2(

)5(

)3(

)7(

)1(

cPy

cMy

cMcMy

cMcMy

cMcMcMcMy

cMcMcMcMy

cMcMcMcMy

cMcMcMcMy

where

11101001110100

32113211

10101010

520523

610612

431431

700700

PPPPPM

PPPPPP

PPPPPM

xxPxxM

xxPxxM

xxPxxM

xxPxxM

Page 11: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

Page 12: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• Step 2 transformation: DCT structure is grouped into different functional units represented by blocks and then the whole DCT structure is transformed into a block diagram.

• Two major blocks

+

+-

x(0)

x(1)

x(0)+x(1)

x(0)-x(1)

+

+

x(0)

x(1)

ax(0)+bx(1)

bx(0)-ax(1)

a

a

bb

Page 13: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• The transformed block diagram is:

Page 14: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• Step 3- reduce complexity of the implementations of the blocks.

• The block can be realized by using 3 multiplications and 3 additions instead of 4.

• Define the block with a=sin, and b = cos, and reversed outputs as a rotator block that computes

y

x

y

x

cossin

sincos

'

'

Other transformations

Page 15: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Algorithm-Architecture Transformation of DCT

• Final architecture

13 multiplications, 31 additions

Page 16: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Decimation-in Frequency Fast DCT for 2m-Point IDCT

• DIF commonly used in DFT.

• Reduce the # of multiplications to about (N/2)log2N by power-of-2 decomposition.

• For simplicity the 2/N scaling factor is ignore. We have )()()(ˆ kXkekX

Page 17: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Fast DCT/IDCT (FCT)– Decomposing into even and odd indexes of k

)(

12/

0

)(

12/

0

)2

)12)(12(cos()12()12()

2

2)12(cos()2()2()(

nh

N

k

ng

N

k N

knkXke

N

knkXkenx

IDCTpoNN

knkXke

N

knkXkeng

N

k

N

k

int2/))12(

cos()2()2()2

2)12(cos()2()2()(

12/

0

12/

0

For h’(n) we use )2

)1(2)12(cos()

2

2)12(cos()

2

)12)(12(cos()

2

)12(cos(2

N

kn

N

kn

N

nk

N

n

We have

IDCTpoNN

knkXkekXke

N

knkXkekXke

N

knkXke

N

knkXke

N

knkXke

N

knkXkenh

N

n

N

k

N

k

N

k

N

k

N

k

N

k

int2/......))12(

(cos)]12()12()12()12([

)2

2)12((cos)]12()12()12()12([

)2

2)12((cos)12()12()

2

2)12((cos)12()12(

)2

)1(2)12((cos)12()12()

2

2)12((cos)12()12()(')

2

)12(cos(2

12/

0

12/

0

12/

0

12/

0

12/

0

12/

0

0)12()12( 0 kkXke

2cosAcosB=cos(A+B)+cos(A-B)

Page 18: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

N-point IDCT can be decomposed using N/2-point IDCT

)( IDCT,point 2/

12/

0

)( IDCT,point 2/

12/

0

))12(

cos()12()12()12()12()

2

)12(cos(2

1

))12(

cos()2()2()(

nhN

N

k

ngN

N

k

N

knkXkekXke

N

n

N

knkXkenx

)(')

2)12(

cos(2

1)()1(

)(')

2)12(

cos(2

1)()(

nh

Nn

ngnNx

nh

Nn

ngnx

Page 19: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

N-point IDCT Architecture

N-point IDCT

k NkC

k N

1 2

0 2 122 1/

: ~ /

+

+

( )0

( / )N 2 1

( )N 2

( / )N 2 1

Even-OddIndex

Mapping

+

++

OddSummation

X ( )0

X ( )1

X N( ) 2

X N( ) 1

X ( )0

X ( )2

X N( ) 2

X N( ) 4

+

+

+

+

( )1

( / )N 2 2

( / )N 2

( )N 1X N( ) 1

X N( ) 3

X N( ) 5

X ( )3

X ( )1 0

1

N /2 2

N /2 1

N/2-pointIDCTh'(k)

N/2-pointIDCTg(k)

+

+

Re-order

Page 20: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

• Since

N

n

N

nNN

kn

N

knN

2

)12(cos

2

)1)1(2(cos

)12(cos

)1)1(2(cos

N-point IDCT can be expressed in terms of two N/2-point IDCT. By repeating this process, the IDCT can be decomposed further until it can be expressed in terms of 2-point IDCTs (DCT can be decomposed in a similar fashion)

2-point IDCT butterfly architecture

4cos)1(ˆ)0(ˆ)1(

4cos)1(ˆ)0(ˆ)0(

XXx

XXx

Cos(/4)

-1

)0(X̂

)1(X̂

x(0)

x(1)

Page 21: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

E.g 8-point IDCT

3,2,1,0)(

16)12(

cos2

1)()7(

)(

16)12(

cos2

1)()(

)2/(2

)12(cos)()(

)2/(2

)12(cos)()(

3,2,1,0)12(ˆ)12(ˆ)(

)2(ˆ)(

3

0

3

0

kkhk

kgkx

khk

kgkx

N

nknHkh

N

nknGkg

nnXnXnH

nXnG

n

n

Page 22: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

8-point IDCT architecture

Page 23: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Complexity comparison

Page 24: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Multiplier-less DCT architecture

• Using distributed arithmetic• More area-efficient realization of hardware• Replacement of multipliers by memory

look-up table• Regularity of the highly concurrent

structure allows modular design of the circuit

• Bit-serial and bit-parallel structure – saving area and ease of routing

Page 25: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Distributed Arithmetic (B. Liu -74)

• The most-often encountered form of computation in DSP:

– Sum of product

– Dot-product

– Inner-product

• Distributed arithmetic (DA) is used to design bit-level architectures for vector-vector multiplications (inner products)

– Each word in the vectors is represented as a binary number

– The multiplications are re-ordered and mixed such that the arithmetic becomes “distributed” through the structure

Page 26: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Technical Overview of DA

• Advantage of DA: Efficiency of computing mechanization

• A frequently argued:

– Slowness because of its inherent bit-serial nature

– Some modifications to increase the speed by employing techniques:

– Plus more arithmetic operations

– expense of exponentially increased memory

Page 27: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Conventional distributed arithmetic• An inner product between 2 length-N vectors C

and X

• Where {ci}’s are M-bit constants and {xi}s are coded as W-bit 2’s complement numbers as follows

– Now substituting the above equation, we have

1

0

N

iiixcY

1

11,1, 2

W

j

jjWiWii xxx

1

0

1

11,

1

01,

1

0

1

11,1,

2)(

)2(

N

i

jW

jjWi

N

iiWii

N

i

jW

jjWiWii

xcxc

xxcY

Page 28: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Conventional distributed arithmetic

• Define

• Then

• By interchanging the summing order of i and j, the initial multiplications are now distributed to another computation pattern.

• Since the term Cj depends on xi,j values and has only 2N possible values, it is possible to pre-compute them and store them in a ROM

• An input set of N bits (x0j,x1j,…,xN-1,j) is used as an address to get Cj values

• These intermediate results are accumulated in W clock cycles to produce one Y value.

1

0

1

01,11,1 ),0(

N

i

N

iWiiWjWiijW xcCjxcC

1

01 2

W

j

jjWCY

Page 29: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Example Content of ROM (N=4)

Page 30: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Architecture of computing inner product of two length-N vectors using DA

The results is obtained after W clock cycles.This is called bit-serial distributed arithmetic.Speed is limited because it takes W cycles

Page 31: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Speeding up bit-serial DA

• Use digit-serial distributed arithmetic, where a digit containing multiple bits is processed in a clock cycle

• E.g. if J consecutive bits are processed in a single clock cycle using J ROMs, then the input words are processed in W/J clock cycles.

• A multi-input shift-accumulator adds the contents of J ROMs and the previous accumulated results

Page 32: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

DA with Offset-Binary Coding• Offset-Binary Coding can be used to reduce the ROM

size by a factor of 2.

]22)()([2

1

)]([2

1

)1(1,1,

1

11,1,

WjjWijWi

W

jWiWi

iii

xxxx

xxx

)1(1,

1

11, 22

Wj

jWi

W

jWii xxx

}1,1{1,

1,,

1,1,

,,, ji

WiWi

jijiji d

Wjforxx

Wjforxxd

1

0

)1(1, 22

2

1 W

j

WjjWii dx

Where

Define

(eqn.1)

Eqn 1 can be rewritten as (eqn.2)

Page 33: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

DA with Offset-Binary Coding

)1(1

0

1

01,

1

0

1

0

)1(1,

1

0

2)2

1(2)

2

1(

]22[2

1

WN

i

N

ii

jjWii

W

j

W

j

WjjWii

N

i

cdc

dcY

i

N

iextra

jii

N

ij

cDand

WjfordcD

1

0

,

1

0

2

1

10,2

1

Using eqn. 2, the original Y can be written as

1

0

)1(1 22

W

j

Wextra

jjW DDY

Now define

We have

Page 34: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Content of the ROM with OBC Coding (N=4)

• Table 13.3Dj values are mirrored, therefore Dj has only 2N-1 possible values depending on the xi,j values and the ROM size is reduced by 2

Page 35: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Architecture with OBC coding

Page 36: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

ROM decomposition for DA• ROM size increased exponentially with N

– ROM access time can be a bottleneck esp. when N is large– Reducing the size of ROM is important

• Solution– Divide the N address bits into N/K groups of K bits– Decompose the ROM of size 2N into N/K ROMs of size 2K– Add the outputs of these ROM using a multi-input

accumulator– Reduction of the storage size is balanced by a linear

increase of the computation complexity of the accumulator– Carry-save arithmetic can be used to realize the multi-input

accumulator to minimize the computation time

Page 37: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Multi-input accumulatorCPA: carry propagate adderCSA: carry-save adder

Delay = NTfa Delay = 4Tfa Delay = 3Tfa

More register

Page 38: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Architecture with ROM decomposition

Page 39: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

Conclusion on DA• DA is a very efficient mechanism for computations that are

dominated by inner products (convolution)

• A good way to trade combinational logic with memory for high-performance computation.

• When a many computing methods are compared, DA should be considered. It is not always (but often) best, and never poorly: save gate count around 50% to 80%.

• Application: “VLSI implementation of a 16*16 discrete cosine transform,” by M.-T. Sun, T.-C. Chen, A. M. Gottlieb, IEEE Transactions on Circuits and Systems, Volume: 36 Issue: 4 , April 1989, Page(s): 610 –617, and many other transforms and DSP kernels.

Page 40: ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform

DCT architecture using DA

For small size DCT, we can use combinational logic (CB) to implement the ROM. This will reduce the critical path delay