High Efficient Distributed Video Coding with
Parallelized Design for Cloud Computing
適用於雲端架構下兼具高效能與平行化設計之分散式視訊編碼
CMLab, CSIE, NTU1
Cheng, Han-Ping 程瀚平 Advisor: Prof. Wu, Ja-Ling 吳家麟 教授
2010/6/2
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU2
Trends of Cloud Computing
Cloud Computing makes Clients slimmer&thinner
CMLab, CSIE, NTU3
Video Coding in Cloud Computing
Only need low complexity encoder and decoder at client side Conventional video coding (e.g. H.264)
Encode once, decode many times Low complexity decoder
Distributed Video Coding (DVC) e.g. Video surveillance, wireless sensor
network Low complexity encoder
CMLab, CSIE, NTU4
Distributed Video Coding
Slepian-Wolf Theorem (1973)
Wyner-Ziv Theorem (1976)
CMLab, CSIE, NTU5
RX ≧H(X)Source X
Source Y
Dependency exists but is not exploited
Joint Decoder
X
Y
Encoder X
Encoder YRY ≧H(X)
RX + RY≧?RX + RY≧H(X, Y)
Source X
Source Y
Statistical dependency
Joint Encoder
RX ≧H(X)
Joint Decoder
X
Y
Conventional video coding paradigmRY ≧H(Y)
Slepian&Wolf : H(X, Y) !!
Distributed Video Coding
Wyner-Ziv Theorem (1976) Extend to lossy coding
CMLab, CSIE, NTU6
Dependency exists but is not exploited
Joint DecoderEncoder X
Source XSource
Encoder
XSource Decoder
Virtual channelVirtual channel
Encoder Y
Source YY
Source Encoder
Source Decoder
Side information estimation
X’
DVC is also called Wyner-Ziv (WZ) video coding
Quantizer
Quantizer
Channel Encoder
Channel Decoder
Y
Channel Encoder
Channel Decoder
Noisy Channel
X’XX+P (X+P)’
Channel coding (Error Control Code):
RX + RY≧ ?
Wyner&Ziv : H(X, Y) !
RY ≧H(Y)
RX ≧H(X|Y)
Correlation is exploited
P
Video Coding in Cloud Computing
WZ to H.264 video transcoder
CMLab, CSIE, NTU7
WZ to H.264 Transcoder
CloudComputational Resource
WZ encoder(Low Complexity)
H.264 decoder(Low Complexity)
WZ encoded bitstream
H.264 encoded bitstream
Motivation
There is still a gap between Wyner-Ziv video coding and conventional video coding (e.g. H.264/AVC)
Most reported WZ codecs have a high time-delay in the decoder Trends of parallel computing
e.g. Multi-core CPU, GPU Parallelizability of the decoder is essential
CMLab, CSIE, NTU8
DISPAC Video Codec
DIStributed video coding with PArallelized design for Cloud computing (DISPAC) To better rate-distortion (RD)
performance Combine coding tools developed in recent
literatures with some newly developed modules.
To reduce decoding time-delay Highly parallelized decoder.
CMLab, CSIE, NTU9
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU10
DISPAC Video Codec
Combine coding tools of two state-of-the-art WZ codec: DISCOVER codec (Distributed coding for video
services) X. Artigas et al., “The DISCOVER codec: architecture,
techniques and evaluation”, PCS, 2007
MLWZ codec (Motion-learning based Wyner-Ziv video coding)
R. Martin et al., “Statistical motion learning for improved transform domain Wyner-Ziv video coding”, IET Image Processing, 2010
CMLab, CSIE, NTU11
DISCOVER Video Codec
CMLab, CSIE, NTU12
Ref. X. Artigas et al., PCS, 2007
GOP 2
WZKey WZKey Key
GOP 4
WZ
Quantization
CMLab, CSIE, NTU13
Eight quantization matrices
Q1
16 8 0 0
8 0 0 0
0 0 0 0
0 0 0 0
Q2
32 8 0 0
8 0 0 0
0 0 0 0
0 0 0 0
Q3
32 8 4 0
8 4 0 0
4 0 0 0
0 0 0 0
Q4
32 16 8 4
16 8 4 0
8 4 0 0
4 0 0 0
Q5
32 16 8 4
16 8 4 4
8 4 4 0
4 4 0 0
Q6
64 16 8 8
16 8 8 4
8 8 4 4
8 4 4 0
Q7
64 32 16 8
32 16 8 4
16 8 4 4
8 4 4 0
Q8
128 64 32 16
64 32 16 8
32 16 8 4
16 8 4 0
32 = 25
=> use 5 bits
8 = 23
=> use 3 bits
0 bits (不傳送 )
Quantization
CMLab, CSIE, NTU14
DCT coefficient band
Block1
S11 S1
2 S16 S1
7
S13 S1
5 S18 S1
13
S14 S1
9 S112 S1
14
S110 S1
11 S115 S1
16
Block2
S21 S2
2 S26 S2
7
S23 S2
5 S28 S2
13
S24 S2
9 S212 S2
14
S210 S2
11 S215 S2
16
Block3
S31 S3
2 S36 S3
7
S33 S3
5 S38 S3
13
S34 S3
9 S312 S3
14
S310 S3
11 S315 S3
16
DCT coefficient band b1: { S11, S2
1, S31, …SN
1 }
DCT coefficient band b2: { S12, S2
2, S32, …SN
2 }
DCT coefficient band b16: { S116, S2
16, S316, …SN
16 }
…
DC band
AC bands
Bit plane Extraction
CMLab, CSIE, NTU15
00100 00001
00000 11110
Bit planes of DC band:
Bit plane 1:
Bit plane 2:
Bit plane 3:
Bit plane 4:
Bit plane 5:
Channel Encode(LDPCA)
4 6
7
0 6
3
1 7
7
30 1
5
For each DCT coefficient band…
MSB
LSB
Q4
32 16 8 4
16 8 4 0
8 4 0 0
4 0 0 0
DISCOVER Video Codec
CMLab, CSIE, NTU16
Ref. X. Artigas et al., PCS, 2007
白育姍
Dependency exists but is not exploited
Joint DecoderEncoder X
Source XX
Virtual channelVirtual channel
Encoder Y
Source YY
Source Encoder
Source Decoder
Side information estimation
X’
Quantizer
Quantizer
Channel Encoder
Channel Decoder
Y
RY ≧H(Y)
RX ≧H(X|Y)
P
Side Information Creation
CMLab, CSIE, NTU17
XFXB
Low pass filter (3x3 Mean filter)Divide frame to 16x16 non-overlapped blocksMotion estimation (search window: ±32)
( , )
1( , ) ( , ) ( , )x y F B x y
x y B
MAD d d X x y X x d y dN
2 2( , ) ( , ) (1 )x y x y x yCF d d MAD d d K d d
Side Information Creation
CMLab, CSIE, NTU18
XFXB
Side Information Creation
CMLab, CSIE, NTU19
XFXB
(xL, yL )
(xu, yu )Adaptive search range:
L x R
U y B
x N d x N
y N d y N
N
N
N
N(xR yR )
(xB, yB )
Side Information Creation
CMLab, CSIE, NTU20
XFXB
Half pixel motion estimation
Side Information Creation
CMLab, CSIE, NTU21
XFXB
9
1,
arg min , for 1 i 9i
wvmf j i jx j j i
x w x x
Weighted vector median filter:
( , )
( , )i
jj
MSE x Bw
MSE x B
x1
x2
x3
x4
x5x6
x7
x8
x9
Spatial motion smoothing
MSE2
Side Information Creation
CMLab, CSIE, NTU22
XFXB
9
1,
arg min , for 1 i 9i
wvmf j i jx j j i
x w x x
Weighted vector median filter:
( , )
( , )i
jj
MSE x Bw
MSE x B
x1
x2
MSE1
Side Information Creation
CMLab, CSIE, NTU23
XFXB
9
1,
arg min , for 1 i 9i
wvmf j i jx j j i
x w x x
Weighted vector median filter:
( , )
( , )i
jj
MSE x Bw
MSE x B
x1
9
1 11, 1
1 1 11 2 1 3 1 9
2 3 9
=
...
i j jj j
x x w x x
MSE MSE MSEx x x x x x
MSE MSE MSE
The result of x6 is minimumxwvmf = x6 (Final motion vector ! )
Side Information Creation
CMLab, CSIE, NTU24
XFXB
9
1,
arg min , for 1 i 9i
wvmf j i jx j j i
x w x x
Weighted vector median filter:
( , )
( , )i
jj
MSE x Bw
MSE x B
x6
Side Information Creation
CMLab, CSIE, NTU25
XFXB
Block interpolation ( 0.75*XB + 0.25*XF )Bidirectional motion compensation
DISCOVER Video Codec
CMLab, CSIE, NTU26
Ref. X. Artigas et al., PCS, 2007
白育姍
Laplacian Distributio
n
CNM Parameter Estimation
CMLab, CSIE, NTU27
XFXB
Residual frame generation:R( , ) ( , )
( , )2
F xf yf B xb ybX x d y d X x d y dR x y
CNM Parameter Estimation
CMLab, CSIE, NTU28
( , ) [ ( , )]n nT u v DCT R x y
Residual frame DCT transform : (4x4)
RT z
258
10
-30 120
0.5
-6
35
5
-24 200
-40
20
21
Variance of
ˆBand 1 : 22
Variance of
ˆBand 2 : 23
Variance of
ˆBand 3 :
CNM Parameter Estimation
CMLab, CSIE, NTU29
T
258
10
-30 120
0.5
-6
35
5
-24 200
-40
20
2 22
2 22
2ˆ, [ ( , )]
ˆˆ ( , )
2ˆ, [ ( , )] >
[ ( , )]
n bb
n
n bn
D u v
u v
D u vD u v
CNM parameter computation:
21 1 1
Assume variance and mean
of band 1 is:
ˆ , [ 1|10 00 ]0 | 5E T
2 2
2
2
(| 258 | )
1 1000
150
08
2
108
nD
2 2
2
(|120 | )
3 1000
15
1000
0
2
0nD
( , ) | ( , ) | [| | ]n n b bD u v T u v E T
DISCOVER Video Codec
CMLab, CSIE, NTU30
Ref. X. Artigas et al., PCS, 2007
白育姍
Correlation Noise Distribution Modeling
CMLab, CSIE, NTU
CNM parameter
Side information
Laplacian distribution
WZ
DISCOVER Video Codec
CMLab, CSIE, NTU32
Ref. X. Artigas et al., PCS, 2007
白育姍
Conditional Bit Prob Computation
: probabilities of the k-th bit is one given side information (Y) and previous k-1 decoded bits
CMLab, CSIE, NTU33
X-Y
Prob.
176/4
144/4
WZ
WZ WZ
WZ
Laplacian pdf
1( 1| , )k
kP B Y B
1( 1| , )k
kP B Y B
1( )kB
( 1)k
B
Need to sum up 256 probabilities0011000 (24) 0011111 (31)
Assume quantization step size is 32
(31-24+1) x 32 = 256
R.P. Westerlaken et al., “Analyzing symbol and bit plane-based LDPC in distributed video coding”, ICIP, 2007.
DISCOVER Video Codec
CMLab, CSIE, NTU34
Ref. X. Artigas et al., PCS, 2007
白育姍
Reconstruction
CMLab, CSIE, NTU35
4
7
6 1
7
7
0
3
6 30
5
1
Channel decode(LDPCA)
Bit plane 1: 0 0 0 1
Bit plane 2: 0 0 0 1
Bit plane 3: 1 0 0 1
Bit plane 4: 0 0 0 1
Bit plane 5: 0 1 0 0
Zig zag order
Bit planes of DC band:
Reconstruction
CMLab, CSIE, NTU36
1
1
1 1
ˆ [ | [ , ), ]
1,
11 1
( ) ( ), [ , )
2 ( )
1,
1
opt i i
i i
i i
i i
x E x x z z y
z y ze
e ey y z z
e e
z y ze
1, , is quantization step sizei iy z z y 2
2 2
is the model parameter related to the variance of
the Laplacian distribution as 1
/
D. Kubasov et al., “Optimal reconstruction in Wyner–Ziv video coding with multiple side information”, IEEE workshop on MMSP, 2007
DISCOVER Video Codec
CMLab, CSIE, NTU37
Ref. X. Artigas et al., PCS, 2007
Poor RD performance for high motion and large GOP size sequences
白育姍
DISCOVER Video Codec
CMLab, CSIE, NTU38
Ref. X. Artigas et al., PCS, 2007
Rooms for Improvement
白育姍
MLWZ Video Codec
CMLab, CSIE, NTU39
Ref. R. Martin et al., IET Image Processing, 2010
SI (Y)
WZ (R)
Search rangeSMF1=0.1
SMF2=0.02
SMF81=0.1
( , )( , ) {( , )}bn x ySSE m m
n x y x ySMF m m P m m e
Update SMF:Normalize SMF:
( , )( , )
( , )x y
n x yn x y S S
n x ym S m S
SMF m mSMF m m
SMF m m
白育姍
白育姍
MLWZ Video Codec
CMLab, CSIE, NTU40
Ref. R. Martin et al., IET Image Processing, 2010
DCTY
SI
Search range
……
MLY
( , )( , ) ( , ) ( , )x y
x y
S SML DCT
nn n m m x ym S m S
Y u v Y u v SMF m m
Side information re-estimation:
MLWZ Video Codec
CMLab, CSIE, NTU41
Ref. R. Martin et al., IET Image Processing, 2010
( , )ˆ ( , )| ( , ) ( , )|
( ( , ) ( , ))
ˆ ( , )( , )( )
2
DCT DCTn n n m mx y
x y
ML DCT DCTn n n
S Su v X u v Y u vn
n x ym S m S
p X u v Y u v
u vSMF m m e
Correlation Noise Distribution Modeling:
DCT coefficient of WZ
DCT coefficient SI
Laplacian distributionLaplacian parameter
Sum of Laplacian !
白育姍
MLWZ Video Codec
CMLab, CSIE, NTU42
Ref. R. Martin et al., IET Image Processing, 2010
Improve RD performance in high motion and large GOP size sequences
Rooms for Improvement
白育姍
DISPAC Video Codec
CMLab, CSIE, NTU43
邱柏叡
邱柏叡Half-pixel motion estimation:
( , )
1( , ) ( , ) ( , )x y R P x y
x y B
MAD d d X x y X x d y dN
2 2( , ) ( , ) (1 )x y x y x yCF d d MAD d d K d d
( , )
1( , ) ( , ) ( , )x y R F x y
x y B
MAD d d X x y X x d y dN
白育姍
Reduce decoding time and Improve RD performance
Improve subjective quality
Improve SI for motion learning
For low motion parts
For high motion parts
Improve initial SI and motion learning
DISPAC Video Codec
CMLab, CSIE, NTU44
邱柏叡
邱柏叡白育姍
程瀚平
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU45
RD Performance of DISPAC Test sequences:
QCIF, 15Hz, all frames (150 for Soccer, Foreman, Coastguard and 164 for Hall Monitor)
GOP size: 2, 4, 8 Bitrate and PSNR: only luminance component
CMLab, CSIE, NTU46
Soccer Foreman Coastguard Hall MonitorHigh LowMotion
RD Performance (GOP=2)
CMLab, CSIE, NTU47
RD Performance (GOP=4)
48 CMLab, CSIE, NTU
RD Performance (GOP=8)
CMLab, CSIE, NTU49
3.6 dB3.1 dB
0.9 dB 2.6 dB
3.1 dB1.6 dB
0.2 dB 2.6 dB
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU50
Parallelizing DISPAC Decoder
CMLab, CSIE, NTU51
OpenMP
CUDA
白育姍
邱柏叡
邱柏叡
Side Information Re-Creation
Assume QCIF sequence, 800 4x4 WZ blocks, 1024 search candidates within search range
CMLab, CSIE, NTU
Second iteration(128 candidates)
First iteration(128 candidates)Texture memory
52
Side Information Re-Creation
Reduction algorithm
CMLab, CSIE, NTU53Mark Harris, “Optimizing parallel reduction in CUDA”, NVIDIA Developer Technology, 2007.
( , )
1( , ) ( , ) ( , )x y R B x y
x y B
MAD d d X x y X x d y dN
2 2( , ) ( , ) (1 )x y x y x yCF d d MAD d d K d d
( , )
1( , ) ( , ) ( , )x y R F x y
x y B
MAD d d X x y X x d y dN
Parallelizing DISPAC Decoder
CMLab, CSIE, NTU54
CUDA
CUDA
白育姍
邱柏叡
邱柏叡
Correlation Noise Distribution Modeling
Assume QCIF sequence, 800 4x4 WZ blocks, 1024 possible integer values of X-Y for DCT coefficient band 2
CMLab, CSIE, NTU55
176/4
144/4
WZ
WZWZ WZ
Skip Intra
WZ1024 integer values
X-Y
PCNM
Sum of Laplacian pdf
Correlation Noise Distribution Modeling
CMLab, CSIE, NTU56
Conditional Bit Prob Computation
: probabilities of the k-th bit is one given side information (Y) and previous k-1 decoded bits
CMLab, CSIE, NTU57
X-Y
PCNM
176/4
144/4
WZ
WZWZ WZ
Skip Intra
WZ
Sum of Laplacian pdf
1( 1| , )k
kP B Y B
1( 1| , )k
kP B Y B
1( )kB
( 1)k
B
Need to sum up 256 probabilities0011000 (24) 0011111 (31)
Assume quantization step size is 32
(31-24+1) x 32 = 256
R.P. Westerlaken et al., “Analyzing symbol and bit plane-based LDPC in distributed video coding”, ICIP, 2007.
Conditional Bit Prob Computation
CMLab, CSIE, NTU58
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU59
Decoding speed of DISPAC A workstation equipped with an Intel Xeon
E5530 CPU at 2.4GHz and an NVIDIA Tesla C1060 graphics card is used to emulate the basic unit of a Could computing environment.
Operating system: Debian squeeze/sid with 2.6.32-5-amd64 kernel.
QCIF, 15Hz, whole sequence, GOP size 8, quantization table 8 (Q8)
CMLab, CSIE, NTU60
Decoding speed of DISPAC
CMLab, CSIE, NTU61
Bottleneck analysis (sequential decoding)
CNM: Correlation Noise Modeling
Decoding speed of DISPAC
CMLab, CSIE, NTU62
Foreman Soccer Coastguard Hall Monitor
22.81 16.64 27.77
9.21
232.06
179.95
293.17
184.08
120.27 115.51 126.88
104.39
8.75 8.43 9.06 8.02
Speedup ratio of decoding modules (8core+GPU)
LDPCA Decode CNM SI Re-Creation Others
Decoding speed of DISPAC
63
DISCOVER MLWZ DISPAC
84.7875.3
48.35
1.54
Foreman
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
81.31 84.38
29.83
1.33
Soccer
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
62.31
74.72 77.95
1.9
Coastguard
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
13.78
33.18
15.93
1.19
Hall Monitor
Sequential 8core+GPU
Average decoding time per frame (sec.)
Decoding speed of DISPAC
64
DISCOVER MLWZ DISPAC
1.00 1.13 1.75
55.1238371511453
Foreman
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
1.00 0.96 2.73
60.9697161975081
Soccer
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
1.00 0.83 0.80
32.7941378891544
Coastguard
Sequential 8core+GPU
DISCOVER MLWZ DISPAC
1.00 0.42 0.87
11.5701702530149
Hall Monitor
Sequential 8core+GPU
Speed up ratio (compare to DISCOVER)
Outline
Introduction DISPAC video codec RD performance of DISPAC Parallelizing DISPAC decoder Decoding speed of DISPAC Conclusions and future work
CMLab, CSIE, NTU65
Conclusions
DISPAC combined the coding tools developed in recent literatures (e.g. MLWZ codec) with some newly developed modules (block mode selection, SI re-creation and adaptive deblocking filter). Up to 3.6 dB gain on RD performance
The decoding modules can be highly parallelized. Up to 61 times faster than state-of-the-art DVC codec
CMLab, CSIE, NTU66
Future Work
Update the correlation noise model parameter during decoding process. For RD performance
Improve parallelizability of the parallel LDPCA decoding algorithm for small size parity check matrices. For decoding speed
WZ to H.264 video transcoder. For real demo system
CMLab, CSIE, NTU67
Thank You
CMLab, CSIE, NTU68