Upload
molly-hunt
View
259
Download
16
Embed Size (px)
Citation preview
ELEC692 VLSI Signal Processing Architecture
Lecture 9VLSI Architecture for Discrete
Cosine Transform
Discrete Cosine Transform
• Frequency transform• Used for pattern recognition, image processing, still and
moving image and video processing• N-point sequence x(n), N-point DCT and IDCT pair is
defined as
1,...,1,0,2
)12(cos)()(
2)(
1,...,1,0,2
)12(cos)()()(
1
0
1
0
NkN
knkXke
Nnx
NkN
knnxkekX
N
k
N
n
where
otherwise
kifke
,1
0,2
1)(
N-point DCT/IDCT• N-point DCT and IDCT pair can be derived using a 2N-
point discrete Fourier transform (DFT) pair, using x(n) and its mirror image
12)12(
10)()12()()(
NnNnNx
NnnxnNxnxny
Y(n) is symmetric with respect to midpoint at n=N-1/2. The 2N-point DFT of y(n) is given by (for 0<= k <= 2N-1)
knN
jN
Nn
knN
jN
n
N
n
knN
j
D enNxenxenykY 2
2122
21
0
12
0
2
2
)12()()()(
Substituting n=2N-n’-1 into the second summation, we have
1
0'
2
2'
2
2
0
1'
)1'2(2
2122
2
)'(
)'()12(
N
n
kN
jknN
j
Nn
nNkN
jN
Nn
knN
j
eenx
enxenNx
N-point DCT (cont.)• Now we have
)2
)12(cos()(2
))()((
)()()(
1
0
2
2
1
0
2
)12(1
0
2
)12(
2
2
1
0
2
2
2
21
0
2
2
N
n
Nj
N
n
N
knjN
n
N
knj
Nj
N
n
kN
jknN
jN
n
knN
j
D
N
knnxe
enxenxe
eenxenxkY
Define
otherwise
NkekYkXN
kj
D
,0
10,)()(ˆ)
2(
N-point DCT can be expressed as 2/)(ˆ)()( kXkekX
N-point DCT/IDCT
• N-point 1D-DCT requires N2 multiplications and addition.• For image compression, N X N blocks need N X N 2D
DCT.
• Direct computation of 2D-DCT of length N requires N4 multiplications and additions.
• Using the separability of 2D-DCT, it can be computed by performing N 1D-DCT on the rows of the image block followed by N 1D-DCT on the resulting column.
• Complexity reduced to 2N3 multiply-add operations or 4N3 arithmetic operations.
N
kn
N
knx
N
ncncY nn
N
n
N
nkk 4
)12(2cos
4
)12(2cos
)()(2 2211,
1
0
1
02
21, 21
1
21
2D DCT• The 2-D Discrete Cosine Transform has shown
to be separable, i.e., it can be expressed as two consecutive l-D transforms.
• Observe that in X and x are 2-D (NxN) data matrices. A 2-D transform can now be calculated using an 1-D transform hardware unit twice, making a matrix transposition of the intermediate result in between.
TAxAXDDCT
AxXDDCT
:2
:1
Block diagram and timing diagram of DCT core processor
Algorithm-Architecture Transformation of DCT
• A hierarchical way to adapt an architecture to a given algorithm or change the algorithm’s description in a systematic way.
• The multiplication of DCT can be reduced using this technique, e.g. 8-point DCT
7,...,1,0)16
)12(cos()()(
7
0
kkn
nxakyn
k
Combining ak and the cosine expression into one coefficient bn,k, we have the following dataflow graph
Algorithm-Architecture Transformation of DCT
)7(
)6(
)5(
)4(
)3(
)2(
)1(
)0(
)7(
)6(
)5(
)4(
)3(
)2(
)1(
)0(
9271331173217
26142221030186
1112313325155
28201242820124
137127211593
30262218141062
15131197531
44444444
x
x
x
x
x
x
x
x
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
y
y
y
y
y
y
y
y
16cos
ici
We can write the dataflow graph as a matrix form
where
Transformation in 3 steps
1st step, systematically modify the DCT algorithm, here using trigonometric properties
)7(
)6(
)5(
)4(
)3(
)2(
)1(
)0(
)7(
)6(
)5(
)4(
)3(
)2(
)1(
)0(
75311357
62266226
51733715
44444444
37155173
26622662
13577531
44444444
x
x
x
x
x
x
x
x
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
cccccccc
y
y
y
y
y
y
y
y
Algorithm-Architecture Transformation of DCT
• Then the 8-point DCT can be rewritten as
4100
4100
211610
611210
73123150
13725130
33521170
53327110
)0(
)4(
)6(
)2(
)5(
)3(
)7(
)1(
cPy
cMy
cMcMy
cMcMy
cMcMcMcMy
cMcMcMcMy
cMcMcMcMy
cMcMcMcMy
where
11101001110100
32113211
10101010
520523
610612
431431
700700
PPPPPM
PPPPPP
PPPPPM
xxPxxM
xxPxxM
xxPxxM
xxPxxM
Algorithm-Architecture Transformation of DCT
Algorithm-Architecture Transformation of DCT
• Step 2 transformation: DCT structure is grouped into different functional units represented by blocks and then the whole DCT structure is transformed into a block diagram.
• Two major blocks
+
+-
x(0)
x(1)
x(0)+x(1)
x(0)-x(1)
+
+
x(0)
x(1)
ax(0)+bx(1)
bx(0)-ax(1)
a
a
bb
Algorithm-Architecture Transformation of DCT
• The transformed block diagram is:
Algorithm-Architecture Transformation of DCT
• Step 3- reduce complexity of the implementations of the blocks.
• The block can be realized by using 3 multiplications and 3 additions instead of 4.
• Define the block with a=sin, and b = cos, and reversed outputs as a rotator block that computes
y
x
y
x
cossin
sincos
'
'
Other transformations
Algorithm-Architecture Transformation of DCT
• Final architecture
13 multiplications, 31 additions
Decimation-in Frequency Fast DCT for 2m-Point IDCT
• DIF commonly used in DFT.
• Reduce the # of multiplications to about (N/2)log2N by power-of-2 decomposition.
• For simplicity the 2/N scaling factor is ignore. We have )()()(ˆ kXkekX
Fast DCT/IDCT (FCT)– Decomposing into even and odd indexes of k
)(
12/
0
)(
12/
0
)2
)12)(12(cos()12()12()
2
2)12(cos()2()2()(
nh
N
k
ng
N
k N
knkXke
N
knkXkenx
IDCTpoNN
knkXke
N
knkXkeng
N
k
N
k
int2/))12(
cos()2()2()2
2)12(cos()2()2()(
12/
0
12/
0
For h’(n) we use )2
)1(2)12(cos()
2
2)12(cos()
2
)12)(12(cos()
2
)12(cos(2
N
kn
N
kn
N
nk
N
n
We have
IDCTpoNN
knkXkekXke
N
knkXkekXke
N
knkXke
N
knkXke
N
knkXke
N
knkXkenh
N
n
N
k
N
k
N
k
N
k
N
k
N
k
int2/......))12(
(cos)]12()12()12()12([
)2
2)12((cos)]12()12()12()12([
)2
2)12((cos)12()12()
2
2)12((cos)12()12(
)2
)1(2)12((cos)12()12()
2
2)12((cos)12()12()(')
2
)12(cos(2
12/
0
12/
0
12/
0
12/
0
12/
0
12/
0
0)12()12( 0 kkXke
2cosAcosB=cos(A+B)+cos(A-B)
N-point IDCT can be decomposed using N/2-point IDCT
)( IDCT,point 2/
12/
0
)( IDCT,point 2/
12/
0
))12(
cos()12()12()12()12()
2
)12(cos(2
1
))12(
cos()2()2()(
nhN
N
k
ngN
N
k
N
knkXkekXke
N
n
N
knkXkenx
)(')
2)12(
cos(2
1)()1(
)(')
2)12(
cos(2
1)()(
nh
Nn
ngnNx
nh
Nn
ngnx
N-point IDCT Architecture
N-point IDCT
k NkC
k N
1 2
0 2 122 1/
: ~ /
+
+
( )0
( / )N 2 1
( )N 2
( / )N 2 1
Even-OddIndex
Mapping
+
++
OddSummation
X ( )0
X ( )1
X N( ) 2
X N( ) 1
X ( )0
X ( )2
X N( ) 2
X N( ) 4
+
+
+
+
( )1
( / )N 2 2
( / )N 2
( )N 1X N( ) 1
X N( ) 3
X N( ) 5
X ( )3
X ( )1 0
1
N /2 2
N /2 1
N/2-pointIDCTh'(k)
N/2-pointIDCTg(k)
+
+
Re-order
• Since
N
n
N
nNN
kn
N
knN
2
)12(cos
2
)1)1(2(cos
)12(cos
)1)1(2(cos
N-point IDCT can be expressed in terms of two N/2-point IDCT. By repeating this process, the IDCT can be decomposed further until it can be expressed in terms of 2-point IDCTs (DCT can be decomposed in a similar fashion)
2-point IDCT butterfly architecture
4cos)1(ˆ)0(ˆ)1(
4cos)1(ˆ)0(ˆ)0(
XXx
XXx
Cos(/4)
-1
)0(X̂
)1(X̂
x(0)
x(1)
E.g 8-point IDCT
3,2,1,0)(
16)12(
cos2
1)()7(
)(
16)12(
cos2
1)()(
)2/(2
)12(cos)()(
)2/(2
)12(cos)()(
3,2,1,0)12(ˆ)12(ˆ)(
)2(ˆ)(
3
0
3
0
kkhk
kgkx
khk
kgkx
N
nknHkh
N
nknGkg
nnXnXnH
nXnG
n
n
8-point IDCT architecture
Complexity comparison
Multiplier-less DCT architecture
• Using distributed arithmetic• More area-efficient realization of hardware• Replacement of multipliers by memory
look-up table• Regularity of the highly concurrent
structure allows modular design of the circuit
• Bit-serial and bit-parallel structure – saving area and ease of routing
Distributed Arithmetic (B. Liu -74)
• The most-often encountered form of computation in DSP:
– Sum of product
– Dot-product
– Inner-product
• Distributed arithmetic (DA) is used to design bit-level architectures for vector-vector multiplications (inner products)
– Each word in the vectors is represented as a binary number
– The multiplications are re-ordered and mixed such that the arithmetic becomes “distributed” through the structure
Technical Overview of DA
• Advantage of DA: Efficiency of computing mechanization
• A frequently argued:
– Slowness because of its inherent bit-serial nature
– Some modifications to increase the speed by employing techniques:
– Plus more arithmetic operations
– expense of exponentially increased memory
Conventional distributed arithmetic• An inner product between 2 length-N vectors C
and X
• Where {ci}’s are M-bit constants and {xi}s are coded as W-bit 2’s complement numbers as follows
– Now substituting the above equation, we have
1
0
N
iiixcY
1
11,1, 2
W
j
jjWiWii xxx
1
0
1
11,
1
01,
1
0
1
11,1,
2)(
)2(
N
i
jW
jjWi
N
iiWii
N
i
jW
jjWiWii
xcxc
xxcY
Conventional distributed arithmetic
• Define
• Then
• By interchanging the summing order of i and j, the initial multiplications are now distributed to another computation pattern.
• Since the term Cj depends on xi,j values and has only 2N possible values, it is possible to pre-compute them and store them in a ROM
• An input set of N bits (x0j,x1j,…,xN-1,j) is used as an address to get Cj values
• These intermediate results are accumulated in W clock cycles to produce one Y value.
1
0
1
01,11,1 ),0(
N
i
N
iWiiWjWiijW xcCjxcC
1
01 2
W
j
jjWCY
Example Content of ROM (N=4)
Architecture of computing inner product of two length-N vectors using DA
The results is obtained after W clock cycles.This is called bit-serial distributed arithmetic.Speed is limited because it takes W cycles
Speeding up bit-serial DA
• Use digit-serial distributed arithmetic, where a digit containing multiple bits is processed in a clock cycle
• E.g. if J consecutive bits are processed in a single clock cycle using J ROMs, then the input words are processed in W/J clock cycles.
• A multi-input shift-accumulator adds the contents of J ROMs and the previous accumulated results
DA with Offset-Binary Coding• Offset-Binary Coding can be used to reduce the ROM
size by a factor of 2.
]22)()([2
1
)]([2
1
)1(1,1,
1
11,1,
WjjWijWi
W
jWiWi
iii
xxxx
xxx
)1(1,
1
11, 22
Wj
jWi
W
jWii xxx
}1,1{1,
1,,
1,1,
,,, ji
WiWi
jijiji d
Wjforxx
Wjforxxd
1
0
)1(1, 22
2
1 W
j
WjjWii dx
Where
Define
(eqn.1)
Eqn 1 can be rewritten as (eqn.2)
DA with Offset-Binary Coding
)1(1
0
1
01,
1
0
1
0
)1(1,
1
0
2)2
1(2)
2
1(
]22[2
1
WN
i
N
ii
jjWii
W
j
W
j
WjjWii
N
i
cdc
dcY
i
N
iextra
jii
N
ij
cDand
WjfordcD
1
0
,
1
0
2
1
10,2
1
Using eqn. 2, the original Y can be written as
1
0
)1(1 22
W
j
Wextra
jjW DDY
Now define
We have
Content of the ROM with OBC Coding (N=4)
• Table 13.3Dj values are mirrored, therefore Dj has only 2N-1 possible values depending on the xi,j values and the ROM size is reduced by 2
Architecture with OBC coding
ROM decomposition for DA• ROM size increased exponentially with N
– ROM access time can be a bottleneck esp. when N is large– Reducing the size of ROM is important
• Solution– Divide the N address bits into N/K groups of K bits– Decompose the ROM of size 2N into N/K ROMs of size 2K– Add the outputs of these ROM using a multi-input
accumulator– Reduction of the storage size is balanced by a linear
increase of the computation complexity of the accumulator– Carry-save arithmetic can be used to realize the multi-input
accumulator to minimize the computation time
Multi-input accumulatorCPA: carry propagate adderCSA: carry-save adder
Delay = NTfa Delay = 4Tfa Delay = 3Tfa
More register
Architecture with ROM decomposition
Conclusion on DA• DA is a very efficient mechanism for computations that are
dominated by inner products (convolution)
• A good way to trade combinational logic with memory for high-performance computation.
• When a many computing methods are compared, DA should be considered. It is not always (but often) best, and never poorly: save gate count around 50% to 80%.
• Application: “VLSI implementation of a 16*16 discrete cosine transform,” by M.-T. Sun, T.-C. Chen, A. M. Gottlieb, IEEE Transactions on Circuits and Systems, Volume: 36 Issue: 4 , April 1989, Page(s): 610 –617, and many other transforms and DSP kernels.
DCT architecture using DA
For small size DCT, we can use combinational logic (CB) to implement the ROM. This will reduce the critical path delay