Download pptx - High Efficient Distributed Video Coding with Parallelized Design for Cloud Computing

High Efficient Distributed Video Coding with

Parallelized Design for Cloud Computing

適用於雲端架構下兼具高效能與平行化設計之分散式視訊編碼

CMLab, CSIE, NTU1

Cheng, Han-Ping 程瀚平 Advisor: Prof. Wu, Ja-Ling 吳家麟教授

2010/6/2

Outline

Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work

CMLab, CSIE, NTU2

Trends of Cloud Computing

Cloud Computing makes Clients slimmer&thinner

CMLab, CSIE, NTU3

Video Coding in Cloud Computing

Only need low complexity encoder and decoder at client side Conventional video coding (e.g. H.264)

Encode once, decode many times Low complexity decoder

Distributed Video Coding (DVC) e.g. Video surveillance, wireless sensor

network Low complexity encoder

CMLab, CSIE, NTU4

Distributed Video Coding

Slepian-Wolf Theorem (1973)

Wyner-Ziv Theorem (1976)

CMLab, CSIE, NTU5

RX ≧H(X)Source X

Source Y

Dependency exists but is not exploited

Joint Decoder

X

Y

Encoder X

Encoder YRY ≧H(X)

RX + RY≧?RX + RY≧H(X, Y)

Source X

Source Y

Statistical dependency

Joint Encoder

RX ≧H(X)

Joint Decoder

X

Y

Conventional video coding paradigmRY ≧H(Y)

Slepian&Wolf : H(X, Y) !!

Distributed Video Coding

Wyner-Ziv Theorem (1976) Extend to lossy coding

CMLab, CSIE, NTU6


Joint DecoderEncoder X

Source XSource

Encoder

XSource Decoder

Virtual channelVirtual channel

Encoder Y

Source YY

Source Encoder

Source Decoder

Side information estimation

X’

DVC is also called Wyner-Ziv (WZ) video coding

Quantizer

Quantizer

Channel Encoder

Channel Decoder

Y

Channel Encoder

Channel Decoder

Noisy Channel

X’XX+P (X+P)’

Channel coding (Error Control Code):

RX + RY≧ ？

Wyner&Ziv : H(X, Y) !

RY ≧H(Y)

RX ≧H(X|Y)

Correlation is exploited

P

Video Coding in Cloud Computing

WZ to H.264 video transcoder

CMLab, CSIE, NTU7

WZ to H.264 Transcoder

CloudComputational Resource

WZ encoder(Low Complexity)

H.264 decoder(Low Complexity)

WZ encoded bitstream

H.264 encoded bitstream

Motivation

There is still a gap between Wyner-Ziv video coding and conventional video coding (e.g. H.264/AVC)

Most reported WZ codecs have a high time-delay in the decoder Trends of parallel computing

e.g. Multi-core CPU, GPU Parallelizability of the decoder is essential

CMLab, CSIE, NTU8

DISPAC Video Codec

DIStributed video coding with PArallelized design for Cloud computing (DISPAC) To better rate-distortion (RD)

performance Combine coding tools developed in recent

literatures with some newly developed modules.

To reduce decoding time-delay Highly parallelized decoder.

CMLab, CSIE, NTU9

Outline


CMLab, CSIE, NTU10

DISPAC Video Codec

Combine coding tools of two state-of-the-art WZ codec: DISCOVER codec (Distributed coding for video

services) X. Artigas et al., “The DISCOVER codec: architecture,

techniques and evaluation”, PCS, 2007

MLWZ codec (Motion-learning based Wyner-Ziv video coding)

R. Martin et al., “Statistical motion learning for improved transform domain Wyner-Ziv video coding”, IET Image Processing, 2010

CMLab, CSIE, NTU11

DISCOVER Video Codec

CMLab, CSIE, NTU12

Ref. X. Artigas et al., PCS, 2007

GOP 2

WZKey WZKey Key

GOP 4

WZ

Quantization

CMLab, CSIE, NTU13

Eight quantization matrices

Q1

16 8 0 0

8 0 0 0

0 0 0 0

0 0 0 0

Q2

32 8 0 0

8 0 0 0

0 0 0 0

0 0 0 0

Q3

32 8 4 0

8 4 0 0

4 0 0 0

0 0 0 0

Q4

32 16 8 4

16 8 4 0

8 4 0 0

4 0 0 0

Q5

32 16 8 4

16 8 4 4

8 4 4 0

4 4 0 0

Q6

64 16 8 8

16 8 8 4

8 8 4 4

8 4 4 0

Q7

64 32 16 8

32 16 8 4

16 8 4 4

8 4 4 0

Q8

128 64 32 16

64 32 16 8

32 16 8 4

16 8 4 0

32 = 25

=> use 5 bits

8 = 23

=> use 3 bits

0 bits (不傳送 )

Quantization

CMLab, CSIE, NTU14

DCT coefficient band

Block1

S11 S1

2 S16 S1

7

S13 S1

5 S18 S1

13

S14 S1

9 S112 S1

14

S110 S1

11 S115 S1

16

Block2

S21 S2

2 S26 S2

7

S23 S2

5 S28 S2

13

S24 S2

9 S212 S2

14

S210 S2

11 S215 S2

16

Block3

S31 S3

2 S36 S3

7

S33 S3

5 S38 S3

13

S34 S3

9 S312 S3

14

S310 S3

11 S315 S3

16

DCT coefficient band b1: { S11, S2

1, S31, …SN

1 }


2, S32, …SN

2 }


16, S316, …SN

16 }

…

DC band

AC bands

Bit plane Extraction

CMLab, CSIE, NTU15

00100 00001

00000 11110

Bit planes of DC band:

Bit plane 1:

Bit plane 2:

Bit plane 3:

Bit plane 4:

Bit plane 5:

Channel Encode(LDPCA)

4 6

7

0 6

3

1 7

7

30 1

5

For each DCT coefficient band…

MSB

LSB

Q4

32 16 8 4

16 8 4 0

8 4 0 0

4 0 0 0


CMLab, CSIE, NTU16


白育姍


Joint DecoderEncoder X

Source XX

Virtual channelVirtual channel

Encoder Y

Source YY

Source Encoder

Source Decoder

Side information estimation

X’

Quantizer

Quantizer

Channel Encoder

Channel Decoder

Y

RY ≧H(Y)

RX ≧H(X|Y)

P

Side Information Creation

CMLab, CSIE, NTU17

XFXB

Low pass filter (3x3 Mean filter)Divide frame to 16x16 non-overlapped blocksMotion estimation (search window: ±32)

( , )

1( , ) ( , ) ( , )x y F B x y

x y B

MAD d d X x y X x d y dN

2 2( , ) ( , ) (1 )x y x y x yCF d d MAD d d K d d


CMLab, CSIE, NTU18

XFXB


CMLab, CSIE, NTU19

XFXB

(xL, yL )

(xu, yu )Adaptive search range:

L x R

U y B

x N d x N

y N d y N

N

N

N

N(xR yR )

(xB, yB )


CMLab, CSIE, NTU20

XFXB

Half pixel motion estimation


CMLab, CSIE, NTU21

XFXB

9

1,

arg min , for 1 i 9i

wvmf j i jx j j i

x w x x

Weighted vector median filter:

( , )

( , )i

jj

MSE x Bw

MSE x B

x1

x2

x3

x4

x5x6

x7

x8

x9

Spatial motion smoothing

MSE2


CMLab, CSIE, NTU22

XFXB

9

1,


wvmf j i jx j j i

x w x x


( , )

( , )i

jj

MSE x Bw

MSE x B

x1

x2

MSE1


CMLab, CSIE, NTU23

XFXB

9

1,


wvmf j i jx j j i

x w x x


( , )

( , )i

jj

MSE x Bw

MSE x B

x1

9

1 11, 1

1 1 11 2 1 3 1 9

2 3 9

=

...

i j jj j

x x w x x

MSE MSE MSEx x x x x x

MSE MSE MSE

The result of x6 is minimumxwvmf = x6 (Final motion vector ! )


CMLab, CSIE, NTU24

XFXB

9

1,


wvmf j i jx j j i

x w x x


( , )

( , )i

jj

MSE x Bw

MSE x B

x6


CMLab, CSIE, NTU25

XFXB

Block interpolation ( 0.75*XB + 0.25*XF )Bidirectional motion compensation


CMLab, CSIE, NTU26


白育姍

Laplacian Distributio

n

CNM Parameter Estimation

CMLab, CSIE, NTU27

XFXB

Residual frame generation:R( , ) ( , )

( , )2

F xf yf B xb ybX x d y d X x d y dR x y


CMLab, CSIE, NTU28

( , ) [ ( , )]n nT u v DCT R x y

Residual frame DCT transform : (4x4)

RT z

258

10

-30 120

0.5

-6

35

5

-24 200

-40

20

21

Variance of

ˆBand 1 : 22

Variance of

ˆBand 2 : 23

Variance of

ˆBand 3 :


CMLab, CSIE, NTU29

T

258

10

-30 120

0.5

-6

35

5

-24 200

-40

20

2 22

2 22

2ˆ, [ ( , )]

ˆˆ ( , )

2ˆ, [ ( , )] >

[ ( , )]

n bb

n

n bn

D u v

u v

D u vD u v

CNM parameter computation:

21 1 1

Assume variance and mean

of band 1 is:

ˆ , [ 1|10 00 ]0 | 5E T

2 2

2

2

(| 258 | )

1 1000

150

08

2

108

nD

2 2

2

(|120 | )

3 1000

15

1000

0

2

0nD

( , ) | ( , ) | [| | ]n n b bD u v T u v E T


CMLab, CSIE, NTU30


白育姍

Correlation Noise Distribution Modeling

CMLab, CSIE, NTU

CNM parameter

Side information

Laplacian distribution

WZ


CMLab, CSIE, NTU32


白育姍

Conditional Bit Prob Computation

: probabilities of the k-th bit is one given side information (Y) and previous k-1 decoded bits

CMLab, CSIE, NTU33

X-Y

Prob.

176/4

144/4

WZ

WZ WZ

WZ

Laplacian pdf

1( 1| , )k

kP B Y B

1( 1| , )k

kP B Y B

1( )kB

( 1)k

B

Need to sum up 256 probabilities0011000 (24) 0011111 (31)

Assume quantization step size is 32

(31-24+1) x 32 = 256

R.P. Westerlaken et al., “Analyzing symbol and bit plane-based LDPC in distributed video coding”, ICIP, 2007.


CMLab, CSIE, NTU34


白育姍

Reconstruction

CMLab, CSIE, NTU35

4

7

6 1

7

7

0

3

6 30

5

1

Channel decode(LDPCA)

Bit plane 1: 0 0 0 1





Zig zag order

Bit planes of DC band:

Reconstruction

CMLab, CSIE, NTU36

1

1

1 1

ˆ [ | [ , ), ]

1,

11 1

( ) ( ), [ , )

2 ( )

1,

1

opt i i

i i

i i

i i

x E x x z z y

z y ze

e ey y z z

e e

z y ze

1, , is quantization step sizei iy z z y 2

2 2

is the model parameter related to the variance of

the Laplacian distribution as 1

/

D. Kubasov et al., “Optimal reconstruction in Wyner–Ziv video coding with multiple side information”, IEEE workshop on MMSP, 2007


CMLab, CSIE, NTU37


Poor RD performance for high motion and large GOP size sequences

白育姍


CMLab, CSIE, NTU38


Rooms for Improvement

白育姍

MLWZ Video Codec

CMLab, CSIE, NTU39

Ref. R. Martin et al., IET Image Processing, 2010

SI (Y)

WZ (R)

Search rangeSMF1=0.1

SMF2=0.02

SMF81=0.1

( , )( , ) {( , )}bn x ySSE m m

n x y x ySMF m m P m m e

Update SMF:Normalize SMF:

( , )( , )

( , )x y

n x yn x y S S

n x ym S m S

SMF m mSMF m m

SMF m m

白育姍

白育姍

MLWZ Video Codec

CMLab, CSIE, NTU40


DCTY

SI

Search range

……

MLY

( , )( , ) ( , ) ( , )x y

x y

S SML DCT

nn n m m x ym S m S

Y u v Y u v SMF m m

Side information re-estimation:

MLWZ Video Codec

CMLab, CSIE, NTU41


( , )ˆ ( , )| ( , ) ( , )|

( ( , ) ( , ))

ˆ ( , )( , )( )

2

DCT DCTn n n m mx y

x y

ML DCT DCTn n n

S Su v X u v Y u vn

n x ym S m S

p X u v Y u v

u vSMF m m e

Correlation Noise Distribution Modeling:

DCT coefficient of WZ

DCT coefficient SI

Laplacian distributionLaplacian parameter

Sum of Laplacian !

白育姍

MLWZ Video Codec

CMLab, CSIE, NTU42


Improve RD performance in high motion and large GOP size sequences

Rooms for Improvement

白育姍

DISPAC Video Codec

CMLab, CSIE, NTU43

邱柏叡

邱柏叡Half-pixel motion estimation:

( , )

1( , ) ( , ) ( , )x y R P x y

x y B



( , )

1( , ) ( , ) ( , )x y R F x y

x y B


白育姍

Reduce decoding time and Improve RD performance

Improve subjective quality

Improve SI for motion learning

For low motion parts

For high motion parts

Improve initial SI and motion learning

DISPAC Video Codec

CMLab, CSIE, NTU44

邱柏叡

邱柏叡白育姍

程瀚平

Outline


CMLab, CSIE, NTU45

RD Performance of DISPAC Test sequences:

QCIF, 15Hz, all frames (150 for Soccer, Foreman, Coastguard and 164 for Hall Monitor)

GOP size: 2, 4, 8 Bitrate and PSNR: only luminance component

CMLab, CSIE, NTU46

Soccer Foreman Coastguard Hall MonitorHigh LowMotion

RD Performance (GOP=2)

CMLab, CSIE, NTU47


48 CMLab, CSIE, NTU


CMLab, CSIE, NTU49

3.6 dB3.1 dB

0.9 dB 2.6 dB

3.1 dB1.6 dB

0.2 dB 2.6 dB

Outline


CMLab, CSIE, NTU50

Parallelizing DISPAC Decoder

CMLab, CSIE, NTU51

OpenMP

CUDA

白育姍

邱柏叡

邱柏叡

Side Information Re-Creation

Assume QCIF sequence, 800 4x4 WZ blocks, 1024 search candidates within search range

CMLab, CSIE, NTU

Second iteration(128 candidates)

First iteration(128 candidates)Texture memory

52

Side Information Re-Creation

Reduction algorithm

CMLab, CSIE, NTU53Mark Harris, “Optimizing parallel reduction in CUDA”, NVIDIA Developer Technology, 2007.

( , )

1( , ) ( , ) ( , )x y R B x y

x y B



( , )

1( , ) ( , ) ( , )x y R F x y

x y B


Parallelizing DISPAC Decoder

CMLab, CSIE, NTU54

CUDA

CUDA

白育姍

邱柏叡

邱柏叡


Assume QCIF sequence, 800 4x4 WZ blocks, 1024 possible integer values of X-Y for DCT coefficient band 2

CMLab, CSIE, NTU55

176/4

144/4

WZ

WZWZ WZ

Skip Intra

WZ1024 integer values

X-Y

PCNM

Sum of Laplacian pdf


CMLab, CSIE, NTU56


: probabilities of the k-th bit is one given side information (Y) and previous k-1 decoded bits

CMLab, CSIE, NTU57

X-Y

PCNM

176/4

144/4

WZ

WZWZ WZ

Skip Intra

WZ

Sum of Laplacian pdf

1( 1| , )k

kP B Y B

1( 1| , )k

kP B Y B

1( )kB

( 1)k

B

Need to sum up 256 probabilities0011000 (24) 0011111 (31)

Assume quantization step size is 32

(31-24+1) x 32 = 256

R.P. Westerlaken et al., “Analyzing symbol and bit plane-based LDPC in distributed video coding”, ICIP, 2007.


CMLab, CSIE, NTU58

Outline


CMLab, CSIE, NTU59

Decoding speed of DISPAC A workstation equipped with an Intel Xeon

E5530 CPU at 2.4GHz and an NVIDIA Tesla C1060 graphics card is used to emulate the basic unit of a Could computing environment.

Operating system: Debian squeeze/sid with 2.6.32-5-amd64 kernel.

QCIF, 15Hz, whole sequence, GOP size 8, quantization table 8 (Q8)

CMLab, CSIE, NTU60

Decoding speed of DISPAC

CMLab, CSIE, NTU61

Bottleneck analysis (sequential decoding)

CNM: Correlation Noise Modeling


CMLab, CSIE, NTU62

Foreman Soccer Coastguard Hall Monitor

22.81 16.64 27.77

9.21

232.06

179.95

293.17

184.08

120.27 115.51 126.88

104.39

8.75 8.43 9.06 8.02

Speedup ratio of decoding modules (8core+GPU)

LDPCA Decode CNM SI Re-Creation Others


63

DISCOVER MLWZ DISPAC

84.7875.3

48.35

1.54

Foreman

Sequential 8core+GPU


81.31 84.38

29.83

1.33

Soccer



62.31

74.72 77.95

1.9

Coastguard



13.78

33.18

15.93

1.19

Hall Monitor


Average decoding time per frame (sec.)


64


1.00 1.13 1.75

55.1238371511453

Foreman



1.00 0.96 2.73

60.9697161975081

Soccer



1.00 0.83 0.80

32.7941378891544

Coastguard



1.00 0.42 0.87

11.5701702530149

Hall Monitor


Speed up ratio (compare to DISCOVER)

Outline


CMLab, CSIE, NTU65

Conclusions

DISPAC combined the coding tools developed in recent literatures (e.g. MLWZ codec) with some newly developed modules (block mode selection, SI re-creation and adaptive deblocking filter). Up to 3.6 dB gain on RD performance

The decoding modules can be highly parallelized. Up to 61 times faster than state-of-the-art DVC codec

CMLab, CSIE, NTU66

Future Work

Update the correlation noise model parameter during decoding process. For RD performance

Improve parallelizability of the parallel LDPCA decoding algorithm for small size parity check matrices. For decoding speed

WZ to H.264 video transcoder. For real demo system

CMLab, CSIE, NTU67

Thank You

CMLab, CSIE, NTU68