Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
BACKWARD CHANNEL AWARE DISTRIBUTED VIDEO CODING
A Dissertation
Submitted to the Faculty
of
Purdue University
by
Limin Liu
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
December 2007
Purdue University
West Lafayette, Indiana
ii
To my parents, Binbin Lin and Guojun Liu;
To my grandfather, Changlian Liu;
To my husband, Zhen Li;
And in memory of my grandparents, Yuzhu Ruan and Decong Lin.
iii
ACKNOWLEDGMENTS
I am very grateful to my advisor, Professor Edward J. Delp, for his invaluable
guidance and support, for his confidence in me, and for the precious opportunities
he had given to me. I wish to express my sincere thanks to his inspiring instructions
and broad range of expertise, which have led me to the interesting and charming
world of video coding. It has been a great honor to be a part of the Video and Image
Processing (VIPER) lab.
I also thank my Doctoral Committee: Professor Zygmunt Pizlo, Professor Mark
Smith, and Professor Michael D. Zoltowski, for their advice, encouragement, and
insights despite their extremely busy schedules.
I would like to thank the Indiana Twenty-First Century Research and Technology
Fund for supporting the research.
I am very fortunate to work with many incredibly nice and brilliant colleagues in
the VIPER lab. I appreciate the support and friendship from them. Working with
them is one of the joyful highlights during my graduate school life: Dr. Gregory
Cook, Dr. Hyung Cook Kim, Dr. Eugene Lin, Dr. Yuxin Liu, Dr. Paul Salama,
Dr. Yajie Sun, Dr. Cuneyt Taskiran, Dr. Hwayoung Um, Golnaz Abdollahian, Marc
Bosch, Ying Chen, Oriol Guitart, Michael Igarta, Deen King-Smith, Liang Liang,
Ashok Mariappan, Anthony Martone, Aravind Mikkilineni, Nitin Khanna, Ka Ki
Ng, Carlos Wang, and Fengqing Zhu. I would also like to thank our visiting re-
searchers from abroad for their perspectives: Professor Reginald Lagendijk, Professor
Fernando Pereira, Professor Luis Torres, Professor Josep Prades-Nebot, Pablo Sabria,
and Rafael Villoria.
I would like to thank Mr. Mike Deiss and Dr. Haoping Yu for offering me summer
internship at Thomson Corporate Research in 2005. I am particularly grateful for
iv
the opportunity to design and develop the advanced 4:4:4 scheme for H.264, which
was adopted by the Joint Video Team (JVT) and became a new profile in H.264.
I would like to thank Dr. Margaret (Meg) Withgott and Dr. Yuxin (Zoe) Liu for
the summer internship at Sun Microsystems Laboratories in 2006. Their guidance
and encouragement has been very beneficial. I also thank Dr. Gadiel Seroussi of
Mathematical Sciences Research Institute for the discussion on video compression
and general coding problems.
I would like to thank Dr. Vadim Sheinin, Dr. Ligang (Larry) Lu, Dr. Dake He,
Dr. Ashish Jagmohan, and Dr. Jun Chen for the opportunity to work at IBM T.
J. Watson Research Lab as a summer intern in 2007. I enjoyed the numerous and
passionate discussions with them. And I miss all my summer intern friends from IBM
Research Lab.
I would like to thank all my friends from Tsinghua University. Their friendship
along the journey made my undergraduate years a very cherishable memory.
I have dedicated this document to my mother and father, and in memory of my
grandparents for their constant support throughout these years. They taught me to
think positively and make persistent efforts. I would also thank my parents-in-law,
brothers-in-law and sisters-in-law. Their love allows me to experience the joy of a big
family.
Finally, I would like to express my sincere appreciation to my husband, Zhen Li,
for his patience, understanding, encouragement and love. I could never have gone
this far without every bit of his help.
v
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of Image and Video Coding Standards . . . . . . . . . . . 1
1.1.1 Image Coding Standards . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Video Coding Standards . . . . . . . . . . . . . . . . . . . . 4
1.2 Recent Advances in Video Compression . . . . . . . . . . . . . . . . 18
1.2.1 Distributed Video Coding . . . . . . . . . . . . . . . . . . . 19
1.2.2 Scalable Video Coding . . . . . . . . . . . . . . . . . . . . . 27
1.2.3 Multi-View Video Coding . . . . . . . . . . . . . . . . . . . 29
1.3 Overview of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.1 Contributions of The Thesis . . . . . . . . . . . . . . . . . . 32
1.3.2 Organization of The Thesis . . . . . . . . . . . . . . . . . . 34
2 WYNER-ZIV VIDEO CODING . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1.1 Slepian-Wolf Coding Theorem for Lossless Compression . . . 36
2.1.2 Wyner-Ziv Coding Theorem for Lossy Compression . . . . . 37
2.2 Wyner-Ziv Video Coding Testbed . . . . . . . . . . . . . . . . . . . 38
2.2.1 Overall Structure of Wyner-Ziv Video Coding . . . . . . . . 41
2.2.2 Channel Codes (Turbo Codes and LDPC Codes) . . . . . . 43
2.2.3 Derivation of Side Information . . . . . . . . . . . . . . . . . 47
2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 50
vi
Page
2.3 Rate Distortion Analysis of Motion Side Estimation in Wyner-ZivVideo Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Wyner-Ziv Video Coding with Universal Prediction . . . . . . . . . 59
3 BACKWARD CHANNEL AWAREWYNER-ZIV VIDEO CODING . . . . . . . . . . . . . . . . . . . . . . . 64
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Backward Channel Aware Motion Estimation . . . . . . . . . . . . 66
3.3 Backward Channel Aware Wyner-Ziv Video Coding . . . . . . . . . 68
3.3.1 Mode Choices in BCAME . . . . . . . . . . . . . . . . . . . 70
3.3.2 Wyner-Ziv Video Coding with BCAME . . . . . . . . . . . . 72
3.4 Error Resilience in the Backward Channel . . . . . . . . . . . . . . 73
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 COMPLEXITY-RATE-DISTORTION ANALYSIS OF BACKWARD CHAN-NEL AWARE WYNER-ZIV VIDEO CODING . . . . . . . . . . . . . . . 83
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Overview of Complexity-Rate-Distortion Analysis in Video Coding . 84
4.2.1 Power-Rate-Distortion Analysis for Wireless Video Communi-cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 H.264 Encoder and Decoder Complexity Analysis . . . . . . 86
4.2.3 Complexity Scalable Motion Compensated Wavelet Video En-coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Backward Channel Aware Wyner-Ziv Video Coding . . . . . . . . . 88
4.4 Complexity-Rate-Distortion Analysis of BP Frames . . . . . . . . . 90
4.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 The Minimum Motion Estimator . . . . . . . . . . . . . . . 91
4.4.3 The Median Motion Estimator . . . . . . . . . . . . . . . . . 97
4.4.4 The Average Motion Estimator . . . . . . . . . . . . . . . . 101
4.4.5 Comparisons of Minimum, Median and Average Motion Esti-mators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
vii
Page
5.1 Contributions of The Thesis . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
viii
LIST OF TABLES
Table Page
1.1 Comparison of the H.261 and MPEG-1 standards . . . . . . . . . . . . 5
2.1 Generator matrix of the RSC encoders . . . . . . . . . . . . . . . . . . 44
4.1 The variance of the error motion vectors for the minimum and medianmotion estimators (ν = 0, σ2 = 1) . . . . . . . . . . . . . . . . . . . . . 98
4.2 Comparisons of the variance of the error motion vectors for the minimum,median and average motion estimators . . . . . . . . . . . . . . . . . . 102
ix
LIST OF FIGURES
Figure Page
1.1 Block Diagram of a DCT-Based JPEG Coder . . . . . . . . . . . . . . 2
1.2 Discrete Wavelet Transform of Image Tile Components . . . . . . . . . 3
1.3 A Hybrid Motion-Compensated-Prediction Based Video Coder (H.264) 8
1.4 Subdivision of a Picture into Slices . . . . . . . . . . . . . . . . . . . . 9
1.5 (a) INTRA 4 × 4 Prediction (b) Eight “Prediction Directions” . . . . . 11
1.6 Five of the Nine INTRA 4 × 4 Prediction Modes . . . . . . . . . . . . 12
1.7 Segmentation of the Macroblock for Motion Compensation . . . . . . . 14
1.8 An Example of the Segmentation of One Macroblock . . . . . . . . . . 14
1.9 Filtering for Fractional-Sample Accurate Motion Compensation . . . . 16
1.10 Multiframe Motion Compensation . . . . . . . . . . . . . . . . . . . . . 17
1.11 Side Information in DISCUS . . . . . . . . . . . . . . . . . . . . . . . 21
1.12 Block Diagram of the PRISM Encoder . . . . . . . . . . . . . . . . . . 23
1.13 Block Diagram of the PRISM Decoder . . . . . . . . . . . . . . . . . . 23
1.14 Systematic Lossy Error Protection (SLEP) by Combining Hybrid VideoCoding and RS Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.15 Block Diagram of the Layered Wyner-Ziv Video Codec . . . . . . . . . 26
1.16 Hierarchical Structure of Temporal Scalability . . . . . . . . . . . . . . 28
1.17 Uli Sequences (Cameras 0, 2, 4) . . . . . . . . . . . . . . . . . . . . . . 30
1.18 Inter-View/Temporal Prediction Structure . . . . . . . . . . . . . . . . 32
2.1 Correlation Source Coding Diagram . . . . . . . . . . . . . . . . . . . . 36
2.2 Admissible Rate Region for the Slepian-Wolf Theorem . . . . . . . . . 37
2.3 Wyner-Ziv Coding with Side Information at the Decoder . . . . . . . . 39
2.4 Example of Side Information . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Wyner-Ziv Video Coding Structure . . . . . . . . . . . . . . . . . . . . 42
x
Figure Page
2.6 An Example of GOP in Wyner-Ziv Video Coding . . . . . . . . . . . . 42
2.7 Structure of Turbo Encoder Used in Wyner-Ziv Video Coding . . . . . 44
2.8 Example of a Recursive Systematic Convolutional (RSC) Code . . . . 45
2.9 Structure of Turbo Decoder Used in Wyner-Ziv Video Coding . . . . . 46
2.10 Tanner Graph of a (7,4) LDPC Code . . . . . . . . . . . . . . . . . . . 47
2.11 Derivation of Side Information by Extrapolation . . . . . . . . . . . . . 48
2.12 Derivation of Side Information by Interpolation . . . . . . . . . . . . . 49
2.13 Refined Side Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.14 WZVC Testbed: R-D Performance Comparison (Foreman QCIF) . . . 52
2.15 WZVC Testbed: R-D Performance Comparison (Coastguard QCIF) . . 53
2.16 WZVC Testbed: R-D Performance Comparison (Carphone QCIF) . . 54
2.17 WZVC Testbed: R-D Performance Comparison (Silent QCIF) . . . . . 55
2.18 WZVC Testbed: R-D Performance Comparison (Stefan QCIF) . . . . 56
2.19 WZVC Testbed: R-D Performance Comparison (Table Tennis QCIF) . 57
2.20 Wyner-Ziv Video Coding with Different Motion Search Accuracies (Fore-man QCIF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.21 Wyner-Ziv Video Coding with Multi-reference Motion Search (ForemanQCIF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.22 Universal Prediction Side Estimator Context . . . . . . . . . . . . . . . 61
2.23 Side Estimator by Universal Prediction . . . . . . . . . . . . . . . . . . 63
3.1 Adaptive Coding for Network-Driven Motion Estimation (NDME) . . . 67
3.2 Network-Driven Motion Estimation (NDME) . . . . . . . . . . . . . . . 68
3.3 Mode I: Forward Motion Vector for BCAME . . . . . . . . . . . . . . . 71
3.4 Mode II: Backward Motion Vector for BCAME . . . . . . . . . . . . . 71
3.5 Backward Channel Aware Wyner-Ziv Video Coding . . . . . . . . . . . 73
3.6 BCAWZ: R-D Performance Comparison (Foreman QCIF) . . . . . . . 76
3.7 BCAWZ: R-D Performance Comparison (Coastguard QCIF) . . . . . . 77
3.8 BCAWZ: R-D Performance Comparison (Carphone QCIF) . . . . . . . 78
3.9 BCAWZ: R-D Performance Comparison (Mobile QCIF) . . . . . . . . 79
xi
Figure Page
3.10 Comparisons of BCAWZ and WZ with INTRA Key Frames at 511KBits/Second(Foreman CIF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.11 Backward Channel Usage in BCAWZ . . . . . . . . . . . . . . . . . . 80
3.12 R-D Performance with Error Resilience (Foreman QCIF) (Motion Vectorof the 254th Frame is Delayed by Two Frames) . . . . . . . . . . . . . 81
3.13 R-D Performance with Error Resilience (Coastguard QCIF) (Motion Vec-tor of the 200th Frame is Lost) . . . . . . . . . . . . . . . . . . . . . . 82
4.1 Backward Channel Aware Wyner-Ziv Video Coding . . . . . . . . . . . 89
4.2 The Probability Density Function of Z(1) . . . . . . . . . . . . . . . . 93
4.3 Rate Difference of the Minimum Motion Estimator (δx = δy =√
22
, ν = 1) 94
4.4 Rate Difference of the Minimum Motion Estimator (δx = δy =√
24
, ν = 12) 95
4.5 Rate Difference of the Minimum Motion Estimator (δx = δy = 0, ν = 0) 96
4.6 Rate Difference of the Median Motion Estimator (δx = δy =√
22
, ν = 1) 99
4.7 Rate Difference of the Median Motion Estimator (δx = δy =√
24
, ν = 12) 99
4.8 Rate Difference of the Median Motion Estimator (δx = δy = 0, ν = 0) . 100
4.9 Rate Difference of the Average Motion Estimator (δx = δy =√
22
, ν = 1) 103
4.10 Rate Difference of the Average Motion Estimator (δx = δy =√
24
, ν = 12) 103
4.11 Rate Difference of the Average Motion Estimator (δx = δy = 0, ν = 0) 104
4.12 Comparisons of the Minimum, Median and Average Motion Estimators 106
xii
ABBREVIATIONS
AVC Advanced Video Coding
BCAME Backward Channel Aware Motion Estimation
BCAWZ Backward Channel Aware Wyner-Ziv Video Coding
CABAC Context-Adaptive Binary Arithmetic Coding
CAVLC Context-Adaptive Variable-Length Coding
CCITT International Telegraph and Telephone Consultative Committee
DCT Discrete Cosine Transform
DISCUS DIstributed Source Coding Using Syndromes
DVC Distributed Video Coding
DWT Discrete Wavelet Transform
EBCOT Embedded Block Coding with Optimized Truncation
FIR Finite Impulse Response
GOP Groups Of Pictures
ISDN Integrated Services Digital Network
ISO International Organization for Standardization
ITU International Telecommunication Union
JVT Joint Video Team
KLT Karhunen Loeve Transform
LDPC Low Density Parity Check
Mbps Mbits/sec
MC Motion Compensation
MCP Motion-Compensated Prediction
ME Motion Estimation
MPEG Moving Picture Experts Group
xiii
NDME Network-Driven Motion Estimation
NSQ Nested Scalar Quantization
OBMC Overlapped Block Motion Compensation
PCCC Parallel Concatenated Convolutional Code
PRISM Power-efficient, Robust, hIgh-compression, Syndrome-based Mul-
timedia coding
PSNR Peak Signal-to-Noise Ratio
QP Quantization Parameter
RSC Recursive Systematic Convolutional
RVLC Reversible Variable Length Code
SLEP Systematic Lossy Error Protection
SWC Slepian-Wolf coding
TCSQ Trellis-Coded Scalar Quantization
VCEG Video Coding Experts Group
VLC Variable-Length Code
WZVC Wyner-Ziv Video Coding
xiv
ABSTRACT
Liu, Limin Ph.D., Purdue University, December, 2007. Backward Channel AwareDistributed Video Coding. Major Professor: Edward J. Delp.
Digital image and video coding has witnessed rapid development in the past
decades. Conventional hybrid motion-compensated-prediction (MCP) based video
coding exploits both spatial and temporal redundancy at the encoder. Hence the
encoder requires much more computational resources than the decoder. This poses
a challenge for applications such as video surveillance systems and wireless sensor
networks. Only limited memory and power are available at the encoder for these
applications, while the decoder has access to more powerful computational resources.
The Slepian-Wolf theorem and Wyner-Ziv theorem have proved that a distributed
video coding scheme is achievable where sources are encoded separately and decoded
jointly. The basic goal of our research is to analyze the performance of the low com-
plexity video encoding theoretically, and to design new practical techniques to achieve
a high video coding efficiency while maintaining low encoding complexity.
In this thesis, we propose a new backward channel aware Wyner-Ziv approach.
The basic idea is to use backward channel aware motion estimation to code the key
frames in Wyner-Ziv video coding, where motion estimation is done at the decoder and
motion vectors are sent back to the encoder. We refer to these backward predictive
coded frames as BP frames. A mode decision scheme through the feedback channel
is studied. Compared to Wyner-Ziv video coding with INTRA coded key frames,
our approach can significantly improve the coding efficiency. We further consider the
scenario when there are transmission errors and delays over the backward channel. A
hybrid scheme with selective coding is proposed to address the problem. Our results
show that the coding performance can be improved by sending more motion vectors
xv
to the encoder. However, there is a tradeoff between complexity and rate-distortion
performance in backward channel aware Wyner-Ziv video coding. We present a model
to quantitatively analyze the complexity and rate-distortion tradeoff for BP frames.
Three estimators, the minimum estimator, the median estimator and the average
estimator, are proposed and the complexity-rate-distortion analysis is presented.
1
1. INTRODUCTION
Digital images and videos are everywhere today with a wide range of applications,
such as high definition television, video delivery by mobile telephones and handheld
devices. Multimedia information is digitally represented so that it can be stored and
transmitted conveniently and accurately. However, digital image and video data gen-
erally require huge storage and transmission bandwidth. Even with the rapid increase
in processor speeds, disk storage capacity and broadband networks, an efficient rep-
resentation of the image and video signal is needed. Video compression algorithms
are used to reduce the data rate of the video signal while maintaining video quality.
A typical video coding system consists of an encoder and a decoder, which is referred
to as a codec [1]. To ensure the inter-operability between different platforms and
applications, image and video compression standards have been developed over the
years
In this chapter, we first provide an overview of the image and video coding stan-
dards. In particular, we describe the current video coding standard H.264 in detail.
We then discuss on-going research within the video coding standard community, and
give an overview of the recent advances in video coding and their potential applica-
tions.
1.1 Overview of Image and Video Coding Standards
1.1.1 Image Coding Standards
JPEG (Joint Photographic Experts Group) [2–6] is a group established by mem-
bers from both the International Telecommunication Union (ITU) and the Interna-
tional Organization for Standardization (ISO) to work on image coding standards.
2
DCT Quantizer Entropy Coder
Quantization Tables
Coding Tables
Headers
Tables
Data
(a) encoder
Headers
Tables
Data Entropy Decoder
Inverse Quantizer
IDCT
Coding Tables
Quantization Tables
Decoded Image
(b) decoder
Fig. 1.1. Block Diagram of a DCT-Based JPEG Coder
The JPEG Standard
JPEG specifies a still image coding process and the file format of the bitstream.
An input image is first divided into non-overlapping blocks of size 8 × 8. Each block
is transformed into the frequency domain by the DCT transform coding, followed
by the quantization of the DCT coefficients and entropy coding. Fig. 1.1(a) shows
a block diagram of a Discrete Cosine Transform (DCT)-based JPEG encoder. The
process is repeated for the three color components for color images. The decoding
process is the inverse operation of the encoding process in an order that is the reverse
of that of the encoder. Fig. 1.1(b) shows the block diagram of the DCT-based JPEG
decoder.
3
Tiling DWT on each tile
Fig. 1.2. Discrete Wavelet Transform of Image Tile Components
The JPEG 2000 Standard
JPEG 2000 [4, 7, 8] is a wavelet-based image compression standard. It provides
superior low data rate performance. JPEG 2000 also provides efficient scalability
such as rate scalability and resolution scalability, which allows a decoder to decode
and extract information from part of the compressed bit stream.
The main processing blocks of JPEG 2000 include transform, quantization and
entropy coding. First the input image is decomposed into components that are han-
dled separately. There are two possible choices: the YCrCb domain and the YUV
domain. The input component is then divided into rectangular non-overlapping tiles,
which are processed independently as shown in Fig. 1.2. The use of discrete wavelet
transform (DWT) instead of DCT as in JPEG is one of the major differences between
JPEG and JPEG 2000. The tiles can be transformed into different resolution levels
to provide Region-of-Interest (ROI) coding. Before entropy coding, the transform
coefficients are quantized within the subband. Arithmetic coding is used in JPEG
2000.
4
1.1.2 Video Coding Standards
Since the early 1990s, a series of video coding standards have been developed to
meet the growing requirements of video applications. Two groups have been actively
involving in the standardization activities: the Video Coding Experts Group (VCEG)
and the Moving Picture Experts Group (MPEG). VCEG is working under the direc-
tion of the ITU Telecommunication Standardization Sector (ITU-T), which is formerly
known as International Telegraph and Telephone Consultative Committee (CCITT).
This group typically works on the standard with the names “H.26x”. MPEG carries
out the standardization work under ISO/IEC and labels them as “MPEG-x”. In this
section, we briefly review the standards in the chronological order and describe the
latest standard H.264 in details.
H.261
H.261 was approved in 1991 for video-conferencing systems and video-phone ser-
vices over the integrated services digital network (ISDN) [9] [10]. The target data
rate is at multiples of 64 kbps. The H.261 standard has mainly two modes: the
INTRA and INTER modes. The INTRA mode is basically similar to the JPEG com-
pression (Section 1.1.1) where DCT-based block transform is used. For the INTER
mode, motion estimation (ME) and motion compensation (MC) were first adopted.
The motion search resolution adopted in H. 261 is integer-pixel accuracy. For every
16 × 16 macroblock, one motion vector is chosen from a search window centered by
the original pixel position.
MPEG-1
MPEG-1 was started in 1988 and finally approved in 1993. The main application
of MPEG-1 is for storage of video data on various digital storage media such as CD-
ROM. MPEG-1 added more features to H.261. The comparison of the H.261 and
5
Table 1.1Comparison of the H.261 and MPEG-1 standards
H.261 MPEG-1
Sequential access Random access
One basic frame rate Flexible frame rate
CIF and QCIF images only Flexible image size
I and P frames I, P and B frames
MC over 1 frame MC over 1 or more frames
1 pixel MV accuracy 1/2 pixel MV accuracy
Variable threshold + uniform quantization Quantization matrix
Slice structure Uses groups of pictures (GOP)
MPEG-1 standards are shown in Table 1.1 [9]. A significant improvement is the
introduction of bi-directionally prediction in MPEG-1, where both the previous and
next reconstructed frames can be used as reference frames.
MPEG-2/H.262
MPEG-2, also known as H.262, was developed as a joint effort between VCEG
and MPEG. MPEG-2 aims to serve a variety of applications, such as DVD video and
digital television broadcasting. The key features are:
• MPEG-2 accepts not only progressive video, but also interlaced video. It also
adds a new macroblock size of 8 × 16.
• MPEG-2 provides a scalable bitstream. The syntax allows more than one layer
of video. Three formats of scalability are available in MPEG-2: spatial scala-
bility, temporal scalability and SNR scalability.
• MPEG-2 features more options in the quantization and coding steps to further
improve the video quality.
6
H.263
H.263 [11], and its extensions, known as H.263+ [12] and H.263++, share many
similarities with H. 261 with more coding options. It was originally developed for low
data rate but eventually extended to an arbitrary data rate. It is widely used in the
video streaming applications. With the new coding tools, H.263 can achieve similar
video quality as H.261 with roughly half the data rate or lower. The motion search
resolution specified in H.263 is half-pixel accuracy. The quantized DCT coefficients
are coded with 3-D Variable Length Coding (last, run, level) instead of a 2-D Variable
Length Coding (run, level) and the end-of-block marker. In the advanced prediction
mode, there are two important options which result in significant coding gains:
• H.263 allows four motion vectors per macroblock and the one/four vectors de-
cision is indicated in the macroblock header. Despite using more bits on the
transmission of motion information, it gives more accuracy prediction and hence
results in smaller residual entropy.
• Overlapped block motion compensation (OBMC) is adopted to reduce the
blocking artifacts. The blocks are overlapped quadrant-wise with the neigh-
boring blocks. In H. 263 Annex F, each pixel is predicted by a weighted sum
of three prediction values obtained from three motion vectors. OBMC pro-
vides improved prediction accuracy as well as better subjective quality in video
coding, with increased computational complexity.
MPEG-4
The MPEG-4 standard [13] includes the specifications for Systems, Visual, and
Audio. It has been used in several areas, including digital television, interactive
graphics applications, and multimedia distribution over the networks. MPEG-4 en-
ables object-based video coding by coding the contents independently. A scene is
composed of several Video Objects (VOs). The information of shape, texture, shape
7
motion, and texture motion is extracted by image analysis and coded by parame-
ter coding. MPEG-4 also supports mesh, face and body animation. Its provides
various coding tools for the scalability of contents. The Fine Granular Scalability
(FGS) Profile allows the adaptability to bandwidth variation and resilience to packet
losses. MPEG-4 also incorporates several error-resilience tools [14,15], which achieves
better resynchronization, error isolation and data recovery. NEWPRED mode is a
new error resilience tool adopted in MPEG-4. It allows the encoder to update the
reference frames adaptively. A feedback message from the decoder identifies the lost
or damaged segments and the encoder will avoid using them as further reference.
H.264
After the development of H.263 and MPEG-4, a longer term video coding project,
known as H.26L was set up. It further evolved into H.264, which was approved as a
joint coding standard of both ISO and ITU-T in 2004.
A video coding standard generally defines only syntax of the decoder, which pro-
vides flexibility for encoder optimization. All these standards (Section 1.1.2) are
based on block-based hybrid video coding as shown in Fig. 1.3. More sophisticated
coding methods, such as highly accurate motion search and better texture models,
lead to the advances in video compression.
In H.264, the input video frame is divided into slices, which is further divided
into macroblocks, and each macroblock is processed independently. The basic coding
unit of the encoder and decoder is a fixed-size macroblock which consists of 16 × 16
samples of the luma component and 8× 8 samples of the chroma components for the
4:2:0 format. The macroblock can be further divided into small blocks. A sequence of
macroblocks is organized to form the slice in the raster scan order, which represents a
region in the picture that can be decoded independently. For example, a picture may
contain three slices as shown in Fig. 1.4 [1] [16] . Each slice is self-contained, which
means each slice could be decoded without the information from the other slices.
8
Transform/ Quantization
Dequantization & Inverse Transform
Deblocking Filter
Motion Compensation
Intra-frame Prediction
Motion Estimation
Input Video Signal Split into
Macroblocks
+
+
Frame Buffer
Entropy Coding -
Bitstream
Motion Vector
Intra/Inter
(a) Encoder
Motion Vector
Intra/Inter
Entropy Decoding
Dequantization & Inverse Transform
+ Deblocking
Filter
Frame Buffer
Motion Compensation
Intra-frame Prediction
Bitstream Reconstructed
Frame
(b) Decoder
Fig. 1.3. A Hybrid Motion-Compensated-Prediction Based Video Coder (H.264)
9
Slice #1
Slice #2
Slice #0
Fig. 1.4. Subdivision of a Picture into Slices
10
The basic block diagram of the H.264 encoder is shown in Fig. 1.3-(a). First the
current block is predicted from the previously coded spatially or temporally neigh-
boring blocks. If INTER coding is selected, a motion vector is obtained to rebuild the
current block using motion compensation, where the two-dimensional motion vector
represents the displacement between the current block and its best matching reference
block. The prediction error is transformed by DCT (or integer transform in H. 264)
to reduce the statistical correlation. The transform coefficients are quantized by a
predefined quantization table, where a quantization parameter controls the step size
of the quantizer. Generally one of the following two entropy coding methods is used
for the quantized coefficients: variable-length code (VLC) and arithmetic coding. The
in-loop deblocking filter is used to reduce “blocking” artifacts while maintaining the
sharpness of the edges across block boundaries. It enables the reduction in the data
rate and improves the subjective quality.
There are five slice types, where the first three are similar to the previous standards
and the last two are new [1] [16] [17]:
• I slice: All macroblocks in the slice only uses INTRA prediction.
• P slice: The macroblocks in the slice use not only INTRA prediction, but also
INTER prediction (temporal prediction) with only one motion vector per block.
• B slice: Two motion vectors are available for every block in the slice. A
weighted average of the pixel values form the motion compensated prediction.
• SP slice and SI slice: Switching P slice and switching I slice provide exact
switching between different video streams, as well as random access, fast forward
and reverse. The difference between these two types is that SI slice uses only
INTRA prediction and SP slice uses INTER prediction as well.
The main difference between I slice and P/B slice is that temporal prediction is
not employed in I slice. Three sizes are available for INTRA prediction: 16 × 16,
8 × 8 and 4 × 4. Fig. 1.5 shows the prediction from different directions, where 16
11
a b c d
e f g h
i j k l
m n o p
A B C D E F G H Q
I
J
K
L
0 7 3
5 4
6
1
8
(a) (b)
Fig. 1.5. (a) INTRA 4 × 4 Prediction (b) Eight “Prediction Directions”
samples denoted by a through p are predicted by the neighboring samples denoted
by A through Q. Eight possible directions of prediction are illustrated excluding the
“DC” prediction mode. Fig. 1.6 lists five of the nine INTRA 4 × 4 modes corre-
sponding to the directions in Fig. 1.5. In mode 0 (vertical prediction), the samples
above the top row of the 4 × 4 block are copied directly as the predictor, which is
illustrated by the arrows. Similarly, mode 1 (horizontal prediction) uses the column
to the left of the 4 × 4 block as the predictor. For mode 2 (DC prediction), the
average of the adjacent samples are taken as the predictor. The other six modes
are the diagonal prediction modes. They are known as diagonal-down-left, diagonal-
down-right, vertical-right, horizontal-down, vertical-left, and horizontal-up prediction
respectively. In INTRA 16 × 16 mode, there are only four modes available: vertical,
horizontal, DC, and plane predictions. For the first three, they are similar to the cor-
responding modes in INTRA 4×4. The plane prediction uses the linear combinations
of the neighboring pixels as the predictor.
As mentioned above, the previous standards use DCT for transform coding. In
H.264, a separable integer transform is used. The integer transform is an approxi-
12
+ + +
+
+
+
+
Mode 0 - Vertical Mode 1 - Horizontal Mode 2 - DC
Mode 3 - Diagonal Down/Left Mode 4 - Diagonal Down/Right
Fig. 1.6. Five of the Nine INTRA 4 × 4 Prediction Modes
13
mation of DCT transform that uses integer arithmetic. The basic matrix is designed
as:
T4×4 =
1 1 1 1
2 1 −1 −2
1 −1 −1 1
1 −2 2 −1
(1.1)
The matrix is very simple and only a few additions, subtractions, and bit shifts are
needed for implementation. Moreover, the mismatches due to floating point numbers
between the encoder and the decoder in the DCT transform are avoided.
Two entropy coding methods are supported in H.264 in a context adaptive way:
context-adaptive variable-length coding (CAVLC) and context-adaptive binary arith-
metic coding (CABAC). CABAC achieves higher coding efficiency by sacrificing com-
plexity. It allows a non-integer number of bits per symbol while CAVLC supports only
an integer number of bits per symbol. In general CABAC uses 10% − 15% less bits
than CAVLC. In CAVLC, the codes that occur more frequently are assigned shorter
codes and vice versa. In CABAC, context modeling is first used to enable the choice
of a suitable model for each syntax element. For example, transform coefficients
and motion vectors belong to different models. Similar to a Variable Length Coding
(VLC) table, a specified binary tree structure is then used to support the binarization
process. Finally a context-conditional probability estimation is employed.
In P slices, temporal prediction is used and motion vectors are estimated between
pictures. Compared to the previous standards, H.264 allows more partition sizes that
include 16× 16, 16× 8, 8× 16, or 8× 8. When 8× 8 macroblock partition is chosen,
additional information is used to specify whether it is further partitioned to 8 × 4,
4 × 8, or 4 × 4 sub-macroblocks. The possible partitioning results are shown in Fig.
1.7, where the index of the block shows the order of the coding process. For example,
as shown in Fig. 1.8, the macroblock is partitioned into four blocks of size 8 × 16,
8 × 8, 4 × 8, and 4 × 8 respectively. Each block is assigned one motion vector and
four motion vectors are transmitted for this P macroblock.
14
0 0
1
0 1
2 3 0 1
0 0
1
0 1
2 3 0 1
16*16 16*8 8*16 8*8
8*8 8*4 4*8 4*4
Fig. 1.7. Segmentation of the Macroblock for Motion Compensation
0
16 x 16
1
2 3
Fig. 1.8. An Example of the Segmentation of One Macroblock
15
As in the previous standards, H.264 supports fractional accuracy motion vectors.
If the motion vector points to integer samples, the prediction value is the correspond-
ing pixel value of the reference picture. If the motion vector points to fractional posi-
tions, the prediction values at half-sample positions are obtained by a one-dimensional
6-tap Finite Impulse Response (FIR) filter horizontally and vertically. The prediction
values at quarter-sample positions are obtained by the average of the pixel values at
the integer and half-sample positions. Fig. 1.9 shows the positions of the pixels,
where gray pixels are integer-sample positions. To obtain the half-sample pixels b
and h, two intermediate values b1 and h1 are derived using a 6-tap filter:
b1 = E − 5F + 20G + 20H − 5I + J (1.2)
h1 = A − 5C + 20G + 20M − 5R + T (1.3)
Then the values are clipped to 0 − 255:
b = (b1 + 16) >> 5 (1.4)
h = (h1 + 16) >> 5 (1.5)
The half-sample pixels m and s are obtained in a similar way. The center half-sample
pixel j is derived by:
j = (cc − 5dd + 20h1 + 20m1 − 5ee + ff + 512) >> 10 (1.6)
where the intermediate values cc, dd, m1, ee, and ff are derived similar to b1. The
samples at quarter-pixel positions are obtained as the average of the neighboring
samples either in the vertical or diagonal directions:
a = (G + b + 1) >> 1 (1.7)
e = (b + h + 1) >> 1 (1.8)
In H.264, multiple pictures are available in the reference buffer [16] as shown in
Fig. 1.10. The number of reference buffers is specified in the header. The weighted
prediction using multiple previously decoded pictures significantly outperforms pre-
diction with only one previous decoded picture [18].
16
G a b c H
d e f g
h i j k m
n p q r
s M N
F E
K L
cc dd
R S
T U
gg
hh
P Q
I J
ee ff
A B
C D
aa
bb
Fig. 1.9. Filtering for Fractional-Sample Accurate Motion Compensation
17
d=1
d=2 d=4
4 Prior Decoded Pictures as Reference Current Picture
Fig. 1.10. Multiframe Motion Compensation
18
The usage of B slices significantly improves the coding efficiency. A main feature
of a B slice is that there are two motion vectors available for motion compensation.
Therefore, B slices organize two distinct lists of reference pictures, List 0 and List
1. The standard provides four modes for the prediction of B slices: List 0, List 1,
bi-predictive, and direct modes. For List 0 or List 1 modes, the reference blocks are
derived only from List 0 or List 1. In the bi-predictive mode, a weighted combination
of the reference blocks from List 0 and List 1 are formed as the predictor. In the
direct mode, the motion vector is not derived by motion search but by scaling the
available motion vectors of the co-located macroblock in another reference picture.
As a new feature, H.264 allows a B slice as the reference for its following pictures.
In summary, H.264 incorporates a collection of the state-of-the-art video coding
tools. Here are some important improvements compared to the previous standards
[16]:
• improved motion-prediction techniques;
• adoption of a small block-size exact-match transform;
• application of adaptive in-loop deblocking filter;
• employment of the advanced entropy coding methods.
These new features could reduce the data rate by approximately 50% with similar per-
ceptual quality in comparison to H.263 and MPEG-4 [1]. The enhanced performance
of H.264 presents a promising future for video applications.
1.2 Recent Advances in Video Compression
With the success of the various video coding standards based on hybrid motion
compensated methods, the research community has also been investigating new tech-
niques that can address next-generation video services [5, 19]. These techniques pro-
vide higher robustness, interactivity and scalability. For instance, spatial and tem-
poral models for texture analysis and synthesis have been developed to increase the
19
coding efficiency for video sequences containing textures [20]. Context-adaptive al-
gorithms for intra prediction and motion estimation are proposed in [21] [22]. An
algorithm to identify regions of interests (ROIs) in home video is proposed in [23]. A
low bit-rate video coding approach is presented in [24] which uses modified adaptive
warping and long-term spatial memory. This section describes some recent advances
in video coding. Distributed video coding (DVC) is a new approach that reduces the
complexity of the encoder and has potential applications in video surveillance and
error resilience. Scalable video coding provides flexible adaptation to network condi-
tions. We also give an overview on multi-view and 3D video compression techniques.
1.2.1 Distributed Video Coding
Motion-compensated-prediction (MCP) based video coding systems are highly
asymmetrical since the computationally intensive motion prediction is performed in
the encoder. Generally, the encoder is approximately 5 − 10 times more complex
than the decoder [25]. It satisfies the needs in applications such as video streaming,
broadcast system, and Digital Versatile Disk (DVD), where power constraints are
less a concern at the encoder. However, new applications with limited access to
power, memory, computational resources at the encoder have difficulties using the
conventional video coding systems. For example, in the wireless sensor networks, the
nodes are typically not able to be recharged during the mission. Therefore, a simple
encoder with low complexity is needed.
Distributed coding is one method to achieve low complexity at the encoder. In
distributed coding, source statistics are exploited at the decoder and the encoder can
be simplified. The theoretical basis for the problem dates back to two theorems in
the 1970s. Slepian and Wolf proved a theorem to address lossless compression [26].
Wyner and Ziv extended the results to the lossy compression case [27]. Therefore,
low complexity distributed video encoding approaches are sometimes also referred to
20
as Wyner-Ziv video coding. The two theorems are described in detail in Section 2.1.1
and Section 2.1.2 respectively.
Since these theoretic results were revisited in the late 1990s, several methods
have been developed to achieve the results predicted in these two theorems [28–
37]. They are generally based on the channel coding techniques, for example turbo
codes [38–41] and low-density-parity-check (LDPC) codes [42–45]. Generation of side
information at the decoder is essential to the performance of the system and various
ways to improve it are studied in the literature. A hierarchical frame dependency
structure was presented in [29] where the interpolation of the side information is done
progressively. A weighted vector median filter of the neighboring motion vectors is
used to eliminate the motion outliers in [46, 47]. Side information generated by a
3D face model is combined with block-based generated side information to improve
the quality of the side information [48]. A refined side estimator that iteratively
improves the quality of the side information has been proposed by different research
groups [35, 49–51]. Even though the distributed video coding approaches are still
far from mature, the research community has identified several potential applications
that range from wireless cameras, surveillance systems, distributed streaming and
video conferencing to multiview image acquisition and medical image processing [52].
In this section we give an introduction to the basis codec design for distributed video
coding as well as some applications.
Distributed Source Coding Using Syndromes (DISCUS)
Distributed Source Coding Using Syndromes (DISCUS) [53] is one of the pioneer
works related to distributed source coding. Before describing the design of DISCUS
[53], it is beneficial to introduce an example to give a brief idea of the basic concept.
Suppose X and Y are correlated binary words. The encoder has only the information
21
Y
X X'' Encoder Encoder Decoder Decoder
Y
Fig. 1.11. Side Information in DISCUS
of X and the side information Y is available at the decoder as shown in Fig. 1.11.
The problem is formulated as:
X,Y ∈ {0, 1}n dH(X,Y ) ≤ t (1.9)
where dH(X,Y ) represents the Hamming distance between codewords X and Y . The
encoder sends the index of the coset which X belongs to. The binary linear code is
appropriately chosen with parameter (n, k), where n is the output code length and k
is the input message length. The estimate of X is chosen by the index of the coset and
side information Y . For example, suppose X and Y are 3-bit binary words with equal
probability, i.e., P (X = x) = 18. If we have no information of Y , 3 bit is required to
encode X. Now the decoder observes Y and has the prior knowledge of the correlation
condition that the Hamming distance between X and Y is not greater than one. Four
coset sets are established where the Hamming distance between the two codewords
consisted of every coset is 3, i.e., four coset sets are {000, 111}, {001, 100}, {010, 101},and {100, 011}. A 2-bit code is required to send the index of the coset. Assume
the index received at the decoder is the third one, i.e., X may be 010 or 101. Since
the decoder observes Y is 011, X must be 010 because of the correlation condition.
With side information, the length of the codeword is reduced to 2. The results can
be generalized to the continuous-valued source X and side information Y .
22
PRISM
PRISM (Power-efficient, Robust, hIgh-compression, Syndrome-based Multimedia
coding) is a distributed video coding architecture proposed in [34,54,55]. It addresses
the problem of drift due to prediction mismatch. For video coding, the current mac-
roblock is regarded as X in the distributed source coding and the best predictor for
X from the previous frame is regarded as side information Y .
The block diagrams of the PRISM encoder and decoder are shown in Fig. 1.12 and
Fig. 1.13 respectively. The input video source is first partitioned into 16×16 or 8×8
blocks and blockwise DCT transformed. The DCT coefficients are arranged using
the zigzag scan. Since the first few coefficients carry most of the important infor-
mation about the frames, only about 20% of the DCT coefficients are encoded using
syndrome codes. Simple trellis codes are chosen for small block-lengths. Refinement
quantization is used after the syndrome coding. The rest of the DCT coefficients are
coded as conventional INTRA frames. Hence, no motion search is performed at the
encoder which ensures the simplicity of the encoder. At the decoder, motion search
is used to generate side information for the syndrome decoding. The decoder uses
Viterbi algorithm for the decoding.
With the encoder and decoder structures as shown in Fig. 1.12 and 1.13, PRISM
has the following features:
• low encoding complexity and high decoder complexity: The complexity of the
encoder is comparable to the motion JPEG. The complexity is shifted to the
decoder due to the motion search operations.
• robustness: Since the frames are self-contained encoded, there is no drift prob-
lem existing here. The use of channel codes also improves the inherent robust-
ness. The step size of the base quantizer can be continuously tuned to achieve
a specific target data rate.
23
Blockwise DCT and zig-
zag scan
Base Quantization
Quantization
Syndrome Coding
Entropy Coding
Refinement Quantization Input
Video
Bit Stream
Top Fraction
Bottom Fraction
Fig. 1.12. Block Diagram of the PRISM Encoder
Bit
Str
eam
Syndrome Decoding
Entropy Decoding
Dequantization
Base and Refinement
Dequantization
Inverse Scan and Inverse DCT
Reconstructed Video
Motion Estimation
Side Information
Fig. 1.13. Block Diagram of the PRISM Decoder
24
Wyner-Ziv Video Coding for Error-resilient Compression
A problem with motion-compensated prediction (MCP) based video coder is the
predictive mismatch between the reference frames at the encoder and the decoder
when there is an error during transmission. We also refer this scenario as the drift
problem. Drift errors will propagate through the subsequent frames until an INTRA
frame is sent and lead to significant quality degradation. In [56], an error resilient
method is proposed using Wyner-Ziv video coding by periodically transmitting a
small amount of additional information to prevent the propagation of errors.
The problem of predictive coding is re-formulated as a variant of the Wyner-Ziv
problem in [56, 57]. Denote two successive symbols to be encoded as xn−1 and xn.
Assume xn−1 is the reconstruction of xn−1 at the decoder which is possibly erroneous.
The problem is formulated as coding the symbol xn with the side information xn−1.
The additional information sent by the encoder is termed as “coset information.”
The frame including the coset information is denoted as “peg frames.” The encoder
sends both the residual frame and the coset index for the “peg frame.” The gener-
ation of the coset information is described by the following. First, a forward trans-
form is used on the 4 × 4 block. Then the transform coefficients are quantized.
The LDPC (Low- Density-Parity-Check) encoding is done on each bit-plane of the
transform-domain coefficients. Forward error correction (FEC) is used to protect the
peg information. All the non-peg frames are coded by H.26L. The proposed approach
demonstrates the capability of error correcting in the event of channel loss [56,57].
Systematic Lossy Error Protection
In systematic lossy source-channel coding [58], a digital channel is added as
an enhancement layer in addition to the original analog channel. The Wyner-Ziv
paradigm is employed as forward error protection for systematic lossy source-channel
coding [30,59–61]. The system diagram shown in Fig. 1.14 is referred to as systematic
lossy error protection (SLEP). The main video coder is regarded as the systematic
25
Main Video Encoder
Video Encoder A
(Coarse QP)
Main Video Decoder
Error Concealment
Video Encoder B
(Coarse QP)
RS Encoder Video
Decoder C (Coarse QP)
RS Decoder Replace Lost
Slices by Coarser Version
Err
or-P
rone
Cha
nnel
Reconstructed Previous Frame
WYNER-ZIV DECODER
Side Information
For Decoding Future Frames
Parity Symbols
WYNER-ZIV ENCODER
Reconstructed Previous Frame
Input Video S
Decoded Video S'
Decoded Video S*
LEGACY BROADCASTING SYSTEM
Fig. 1.14. Systematic Lossy Error Protection (SLEP) by CombiningHybrid Video Coding and RS Codes
part of the system. If there is no transmission error, the Wyner-Ziv bitstream is
redundant. When the signal transmitted from the main video encoder encounters
channel error, the Wyner-Ziv coder serves as the augmentation information. In this
situation, the video encoder A codes the input video S by a coarser Quantization
Parameter (QP). The bitstream is sent through the Reed-Solomon (RS) coder and
only the parity bits of the RS coder is transmitted to the decoder. The mainstream
reconstructed signal at the decoder S ′ is coded by a video encoder B which is identical
to the video encoder A. The output serves as the side information for the decoder of
the RS bitstream. After the video decoder C, the reconstructed video signal replaces
the corrupted slices by the coarser version. Hence the system prevents large errors
due to error-prone channel. The proposal outperforms the traditional forward error
correction (FEC) when the symbol error rate is high. An extension of this approach
is presented in [61] where unequal error protection for the bitstream is used to further
improve error resilience performance. The significant data elements, such as motion
vectors, are provided with more parity bits than low priority data elements, such as
transform coefficients.
26
H.26L Video Encoder
H.26L Video Decoder
cKLT/ DCT
NSQ SWC
Cha
nnel
Joint Decoder
Estimation
X
Y
Wyner-Ziv Encoder Wyner-Ziv Decoder
Fig. 1.15. Block Diagram of the Layered Wyner-Ziv Video Codec
Layered Wyner-Ziv Video Coding
A layered Wyner-Ziv video codec was proposed in [62, 63] which also achieves
scalability. It can encode the source once and decode the same bitstream at different
lower rates. The proposed system meets the requirements of unpredictable variations
of the bandwidth. Fig. 1.15 shows the block diagram of the layered Wyner-Ziv video
codec. The H.26L video coder is treated as the base layer and the bitstream from the
Wyner-Ziv video coder is considered as the enhancement layer. Three components
are included in the Wyner-Ziv encoder: DCT transform, nested scalar quantization
(NSQ), and Slepian-Wolf coding (SWC). NSQ partitions the DCT coefficients into
cosets and output the indices of the cosets. Multi-level LDPC code is employed to
design SWC. Every bit plane is associated with a portion of the coefficients, where
the most significant bit plane is assigned as the first bit plane. Each extra bit plane
is regarded as an additional enhancement layer. The decoder recovers the bit planes
sequentially starting with the first plane. Every extra bit plane that the decoder
receives improves the decoding quality.
27
1.2.2 Scalable Video Coding
Scalable video coding provides flexible adaptation to heterogeneous network con-
ditions. The source sequence is encoded once and the bitstream can be decoded
partially or completely to achieve different quality. The base layer provides the basic
information of the sequence and each enhancement layer can be added to improve the
quality incrementally.
Research on scalable video coding has been going on for about twenty years [64].
The rate distortion analysis of scalable video coding are extensively studied in [65–68].
Because of the inherent scalability of the wavelet transform, wavelet based video
coding structure have been exploited. A fully rate scalable video codec, known as
SAMCoW (Scalable Adaptive Motion Compensated Wavelet) was proposed in [69].
A 3-D wavelet transform with MCTF (Motion Compensated Temporal Filtering) [70]
has been developed. On the other hand, a design based on hybrid video coding is
standardized. MPEG-2 was the first video coding standard to introduce the concept
of scalability [70]. The scalable extension of H.264 [64] is a current standardization
effort supporting temporal, spatial, and SNR scalability. Compared to the state-of-
the-art single layer H.264, the scalable extension has a small increase in the decoding
complexity. Spatial and SNR scalability generally have a negative impact on the
coding performance depending on the encoding parameters. In the following, we give
an overview of the concept of temporal, spatial, and SNR scalability respectively.
Temporal Scalability
Temporal scalability allows the decoding at several frame rates from a bitstream.
As shown in Fig. 1.16 temporal scalable video coding can be generated by a hier-
archical structure. The first row index Ti(i = 0, 1, 2, 3) represents the index of the
layers, where T0 is the base layer and Ti(i = 1, 2, 3) are the enhancement layers. The
second row index denotes the coding order where the frames of the lower layer are
coded before the neighboring frames with higher layers. If only T0 layer is decoded,
28
it achieves 7.5 frames per second for the sequence, adding T1 layer can produce a
sequence with 15 frames per second. The frame rate can be further increased to 30
frames per second by decoding T2 layer. The hierarchical structure shows an improved
coding efficiency especially with cascading quantization parameters. The base layer
is encoded with high fidelity following lower quality coded enhancement layers. Even
though the enhancement layers are generally coded as B frames, they can be coded
as P frames to reduce the delay with a cost in the coding efficiency.
T0 T3 T2 T3 T1 T3 T2 T3 T00 1234 5 67 8
Fig. 1.16. Hierarchical Structure of Temporal Scalability
Spatial Scalability
In spatial scalability, each layer corresponds to a specific resolution. Besides the
prediction for single-layer coding, spatial scalable video coding also exploits the inter-
layer correlation to achieve higher coding efficiency. The inter-layer prediction can
use the information from the lower layers as the reference. This ensures that a set
of layers can be decoded independent of all higher layers. The restriction of the
prediction results in lower coding performance than single-layer coding at highest
resolution.
The inter-layer correlation is exploited in several aspects [71]. Inter-layer intra
texture prediction uses the interpolation of the lower layer as the prediction of the
current macroblock. The borders of the block from the lower layer are extended
before applying the interpolation filter. Two modes are available in inter-layer motion
29
prediction: the base layer mode and the quarter pel refinement mode. For the base
layer mode, no additional motion header is transmitted. The macroblock partitioning,
the reference picture indices, and the motion vectors of the lower layer are copied or
scaled to be used in the current layer. For the quarter pel refinement mode, a quarter-
pixel motion vector refinement is transmitted for a refined motion vector. A flag is
sent to signal the use of inter-layer residual prediction, where the residual signal of
the lower layer is upsampled as the prediction and only the difference is transmitted.
SNR Scalability
Two concepts are used in the design of SNR scalable coding: coarse-grain scala-
bility (CGS) and fine-grain scalability (FGS) [64,70,72,73]. In CGS, SNR scalability
is achieved by using similar inter-layer prediction techniques as described in Section
1.2.2 without the interpolation/upsampling. It can be regarded as a special case of
spatial scalability which has the identical frame resolution through the layers. CGS
is characterized by its simplicity in design and low decoder complexity. However, it
lacks flexibility in the sense that no fine tuning of the SNR points is achieved. The
number of the SNR points is fixed to the number of layers.
FGS coding allows the truncating and decoding a bitstream at any point with
bit-plane coding. Progressive refinement (PR) slices are used in FGS to achieve fully
SNR scalability over a wide range of rate-distortion points. The transform coefficients
are encoded successively in PR slices by requantization and a modified entropy coding
process.
1.2.3 Multi-View Video Coding
3D video (3DV) is an extension of two-dimensional video to give the viewer the
impression of depth. ISO/IEC has specified a language to represent 3D graphic
data [74], referred to as virtual reality modeling language (VRML). Later, a language
known as BInary Format for Scenes (BIFS) was introduced as an extension of VRML
30
Fig. 1.17. Uli Sequences (Cameras 0, 2, 4)
[74]. Free viewpoint video (FVV) provides the viewers an interactive environment
with realistic impressions. The viewers are allowed to choose the view positions and
view directions freely. 3DV and FVV have many overlaps in the applications and
they can be combined into a single system. The applications span entertainment,
education, sightseeing, surveillance, archive and broadcasting [75]. Generally multi-
view video sequences are captured simultaneously by multiple cameras. Fig. 1.17
shows an example provided by HHI [76]. The complete test sequences consist of
eight video sequences captured by eight cameras with 20cm spacing using 1D/parallel
projection. Fig. 1.17 shows the first frames of three sequences taken by Camera 0, 2
and 4.
3DV and FVV representations require the transmission of a huge amount of data.
Multi-view video coding (MVC) [77] addresses the problem of jointly compressing
multiple video sequences. Besides the spatial and temporal correlations as in a single-
view video sequences, multi-view video coding also exploits the inter-view correlations
between the adjacent views. As shown in Fig. 1.17, the adjacent sequences recorded
by dense camera settings have a high statistical dependency. Even though temporal
prediction modes are chosen with a high percentage, inter-view prediction is more
suitable for low frame rates and fast motion sequences [76]. After the call for proposals
by MPEG, many MVC techniques have been proposed. A multi-view video coding
scheme based on H.264 has been presented in [76]. The bitstream is designed in
compliance with the H.264 standard and it has shown a significant improvement in
31
coding efficiency over simulcast anchors. It was chosen as the reference solution in
MPEG to build the Joint Multiview Video Model (JMVM) software [78]. An inter-
view direct mode is proposed in [79] to save the bits of coding the motion vectors.
A view synthesis method is discussed in [80] to produce a virtual synthesis view
by the depth map. A novel scalable wavelet based MVC framework is introduced
in [81]. Based on the idea of distributed video coding as discussed in Section 1.2.1,
a distributed multi-view video coding framework is proposed in [82] to reduce the
encoder’s computational complexity and the inter-camera communication.
Joint Multiview Video Model (JMVM) [78] is reference software for MVC devel-
oped by the Joint Video Team (JVT). JMVM adopted the coding method presented
in [76]. The method is based on H.264 with a hierarchical B structure as shown in
Fig. 1.18. The horizontal index Ti denotes the temporal index and the vertical index
Ci denotes the index of the camera. The N video sequences from the N cameras
are rearranged to a single source signal. The spatial, temporal, and inter-view re-
dundancies are removed to generate a standard-compliant compressed bitstream. At
the decoder, the single bitstream is decoded and split to N reconstructed sequences.
For each separate view, a hierarchical B structure as described in Section 1.2.2 is
used. Inter-view prediction is applied to every 2nd view, such as the view taken by
the C1 camera. Each group of picture (GOP) contains N times the length of the
GOP for every view. In the example, the length of the GOP for every view is eight
and the Uli sequences have eight views, which results in a total GOP of 8 × 8 = 64
frames. The order of coding is arranged to minimize the memory requirements. How-
ever, the decoding of a higher layer frame still needs several references. For example,
the decoding of B3 frame at (C0, T1) needs four references, namely, the frames at
(C0, T0), (C0, T8), (C0, T4), and (C0, T2). The decoding of B4 frame at (C1, T1) needs
fourteen references decoded beforehand. From the experimental results, the standard
compliant scheme demonstrates a high coding efficiency with a reasonable encoder
complexity and memory requirement.
32
I0 I0
P0 P0
B1
B1
B1 B1
B2
B2 B2
B2
B2B3 B3
B3 B3 B3 B3
B3 B3 B3 B3
B4 B4 B4 B4
T2T1T0 T3 T5 T7T4 T6 T8
C0
C1
C2
Fig. 1.18. Inter-View/Temporal Prediction Structure
1.3 Overview of The Thesis
1.3.1 Contributions of The Thesis
In this thesis, we study the new approaches for low complexity video coding [37,
49,83–93]. The main contributions of this thesis are:
• Wyner-Ziv Video Codec Architecture Testbed
We studied the Wyner-Ziv video codec architecture and built a Wyner-Ziv video
coding testbed. The video sequences are divided into two different parts using
different coding schemes. Part of the sequence is coded by conventional INTRA
coding scheme and referred as key frames. The other part of the sequence is
coded as Wyner-Ziv frames using channel coding methods. Both turbo codes
and low-density-parity-check (LDPC) codes are supported in the system. The
sequences can be coded either in pixel domain and transform domain. Only
parts of the parity bits from the channel coders are transmitted to the de-
coder. Hence, the decoding of the Wyner-Ziv frames needs side information
at the decoder. The side information can be extracted from the key frames
by extrapolation or interpolation. We study various methods for side informa-
tion generation. In addition, we also analyze the rate-distortion performance of
Wyner-Ziv video coding compared with conventional INTRA or INTER coding.
33
• Backward Channel Aware Wyner-Ziv Video Coding
In Wyner-Ziv video coding, many key frames have to be INTRA coded to keep
the complexity at the encoder low and provide side information for the Wyner-
Ziv frames. However, the use of INTRA frames limits the coding performance.
We propose a new Wyner-Ziv video coding method that uses backward channel
aware motion estimation to code the key frames. The main idea is to do mo-
tion estimation at the decoder and send the estimated motion vector back to
the encoder. In this way, we can keep the complexity low at the encoder with
minimum usage of the backward channel. Our experimental results show that
the scheme can significantly improve the coding efficiency by 1-2 dB compared
with Wyner-Ziv video coding with INTRA-coded key frames. We also propose
to use multiple modes of motion decision, which can further improve the cod-
ing efficiency by 0.5-2 dB. When the backward channel is subject to erasure
errors or delays, the coding efficiency of our method decreases. We provide an
error resilience technique to handle the situation. The error resilience technique
reduces the quality degradation due to channel errors and only incurs a small
coding efficiency penalty when the channel is free of error.
• Complexity-Rate-Distortion Analysis of Backward Channel Aware
Wyner-Ziv Video Coding
We further present a model to study the complexity-rate-distortion tradeoff of
backward channel aware Wyner-Ziv video coding. We present three motion
estimator: minimum motion estimator, median motion estimator and average
motion estimator. Suppose we have several motion vectors derived at the de-
coder. The minimum motion estimator sends all the motion vectors derived at
the decoder to the encoder and the encoder makes a decision to choose the best
motion vector with smallest distortion. When the backward channel bandwidth
cannot meet the requirement or the encoder complexity becomes a concern, the
median or average motion estimator chooses one motion vector at the decoder.
34
The results show the rate-distortion performance of the average estimator is
generally higher than that of the median estimator. If the rate-distortion trade-
off is the only concern, the minimum estimator yields better results than the
other two estimators. However, for applications with complexity constraints,
our analysis shows that the average estimator could be a better choice. Our
proposed model quantitatively describes the complexity-rate-distortion tradeoff
among these estimators.
1.3.2 Organization of The Thesis
The primary objective of this thesis is to analyze the performance of low com-
plexity video encoding, and to design new techniques to achieve a high video coding
efficiency while maintaining low encoding complexity.
In Chapter 2, we first provide an overview of the theoretical background of Wyner-
Ziv coding. Then we describe the Wyner-Ziv video coding testbed we developed, fol-
lowed by a rate-distortion analysis of motion side estimator and coding with universal
prediction.
In Chapter 3, we propose a new backward channel aware Wyner-Ziv approach.
The basic idea is to use backward channel aware motion estimation to code the key
frames, where motion estimation is done at the decoder and motion vectors are sent
back to the encoder. We refer to these backward predictive coded frames as BP
frames. Error resilience in the backward channel is also addressed by adaptive coding
of the key frames.
A model to describe the complexity and rate-distortion tradeoff for BP frames
is presented in Chapter 4. Three estimators, the minimum estimator, the median
estimator and the average estimator, are proposed and complexity-rate-distortion
analysis is presented.
Chapter 5 concludes the thesis.
35
2. WYNER-ZIV VIDEO CODING
Wyner-Ziv coding is a new coding scheme based on two theorems presented in 1970s.
In this coding scenario, source statistics are exploited at the decoder so that it is
feasible to design a simplified encoder. Several practical designs of Wyner-Ziv video
codec are based on channel coding methods. In these systems, the complexity of the
encoder due to motion estimation in hybrid video coding is shifted to the decoder.
Specially, the derivation of side information at the decoder involves high complexity
motion estimation to ensure high quality side information.
In this chapter, we describe the general Wyner-Ziv video coding (WZVC) architec-
ture. Section 2.1 gives an overview of the theoretical background behind Wyner-Ziv
coding. Section 2.2 describes the overall structure of Wyner-Ziv video coding and
the processing units in the system. Section 2.3 presents the rate-distortion analysis
of motion side estimator. Section 2.4 presents Wyner-Ziv video coding with universal
prediction which achieves low complexity at both the encoder and the decoder.
2.1 Theoretical Background
Two theorems presented in the 1970s [26,27] play key roles in the theoretical foun-
dation of distributed source coding. The Slepian and Wolf theorem [26] proved that
the lossless encoding scheme without side information at the encoder may perform
as well as the encoding scheme with side information at the encoder. Wyner and
Ziv [27] extended the result to establish rate-distortion bounds for lossy compression.
There was not much progress on constructive schemes due to the inability of finding
a practical channel coding method [94] that achieves the bound.
Information-theoretic duality between source coding with side information (SCSI)
at the decoder and channel coding with side information (CCSI) at the encoder is
36
Source X
Source Y
Encoder A
Encoder B
Joint Decoder correlated
X ˆ
Y
R X
R Y
Fig. 2.1. Correlation Source Coding Diagram
discussed in [95]. The second scenario was first studied by Costa in the “dirty paper”
problem [96] and exploited again recently because of its expansive applications in
data hiding, watermarking and multi-antenna communications. In this section we
describe the Slepian-Wolf theorem and Wyner-Ziv theorem in detail.
2.1.1 Slepian-Wolf Coding Theorem for Lossless Compression
Suppose two sources X and Y are encoded as shown in Fig. 2.1. When they are
jointly encoded, i.e., the switch between the encoders is on, the admissible rate bound
for the error-free case is:
RX,Y ≥ H(X,Y ) (2.1)
where RX,Y denotes the total data rate to jointly encode X and Y , H(X,Y) denotes
the joint entropy of X and Y [97].
Slepian and Wolf discussed the case when the switch is off. Surprisingly, even
though the two sources are encoded separately, the sum of rates RX and RY can still
achieve the joint entropy H(X,Y ) as long as they are jointly decoded, where RX
represents the data rate used to encode the source X and RY represents the data rate
37
H(Y|X)
H(Y)
H(X,Y)
H(X|Y) H(X) H(X,Y) R X
R Y
Fig. 2.2. Admissible Rate Region for the Slepian-Wolf Theorem
used to encode the source Y . The Slepian-Wolf Theorem proved the admissible rate
bounds for distributed coding of two sources X and Y [26]:
RX ≥ H(X|Y ) (2.2)
RY ≥ H(Y |X) (2.3)
RX + RY ≥ H(X,Y ) (2.4)
where H(X|Y ) denotes the conditional entropy of X given Y , H(Y |X) denotes the
conditional entropy of Y given X. Fig. 2.2 shows the admissible rate region for the
Slepian-Wolf theorem.
2.1.2 Wyner-Ziv Coding Theorem for Lossy Compression
Wyner and Ziv proved the rate-distortion bounds for lossy compression [27]. Al-
though the rate of separate coding may be greater than the rate of joint coding, the
equality is achievable, for example, for a Gaussian source and a mean square error
38
metric when joint decoding is allowed. Hence, the side information at the encoder is
not always necessary to achieve the rate distortion bound. As shown in Fig. 2.3, the
source data X and side information Y are both random variables. The decoder has
access to side information, while the switch determines whether the encoder has the
access to the side information. Wyner and Ziv’s theorem proved that
R∗(d) ≥ RX|Y (d) (2.5)
where d is the measure of the distortion between the source X and the reconstruction
X at the decoder, 0 ≤ d < ∞. R∗(d) denotes the rate distortion function when side
information Y is only available at the decoder. RX|Y (d) denotes the rate distortion
function when side information Y is available at both the encoder and the decoder.
Wyner and Ziv presented a specific case where equality is achieved. They showed
that with Gaussian memoryless sources and mean-squared error distortion, no rate
loss incurs when the encoder has no access of side information, i.e., X is Gaussian
and Y = X + U , where U is also Gaussian and independent of X. The distortion is
d(X, X) = E[(X − X)2] (2.6)
Under these conditions, the rates required to code the source in both cases are:
R∗(d) = RX|Y (d) =
12log
σ2Uσ2
X
(σ2X
+σ2U
)d, 0 < d ≤ σ2
Uσ2X
σ2X
+σ2U
0, d ≥ σ2Xσ2
U
σ2X
+σ2U
(2.7)
2.2 Wyner-Ziv Video Coding Testbed
A Wyner-Ziv video codec generally formulates the video coding problem as an
error correction or noise reduction problem. Hence existing state-of-the-art channel
coding methods are used in the development of Wyner-Ziv codecs. Fig. 2.4 shows
an example of side information in video coding. We use the previously reconstructed
frame as the initial estimate to decode the current frame. The reference frame, as
39
Y
X X'' Encoder Encoder Decoder Decoder
Y
Fig. 2.3. Wyner-Ziv Coding with Side Information at the Decoder
40
(a) Reference Frame (b) Current Frame
Fig. 2.4. Example of Side Information
shown in Fig. 2.4-(a), can be considered as the side information of the current frame
in Fig. 2.4-(b).
Wyner-Ziv video coding using turbo codes and Low Density Parity Check (LDPC)
codes are two popular systems in the literature. Both systems show better rate dis-
tortion performance than the conventional INTRA frame coding. Various methods
have been proposed to improve coding performance. Several papers exploit the re-
lationship between the side information and the original source [98–100]. Analytical
result of the performance of uniform scalar quantizer is presented in [101]. A thorough
study of the statistics of the feedback channel used to request parity bits is presented
in [102]. A Flexible Macroblock Order (FMO)-like algorithm [103] is used to partition
the frame such that spatial or temporal side information is generated adaptively to
outperform the case with motion interpolated side information only. Hash codewords
are sent over to the decoder to aid the decoding of the Wyner-Ziv frame and help
to build a low-delay system in [104]. An encoder or decoder based mode decision is
made to embed a block-based INTRA mode [105, 106]. A simple frame subtraction
process is introduced to code the residual instead of the original frame [107]. In the
following we describe our Wyner-Ziv video coding testbed.
41
2.2.1 Overall Structure of Wyner-Ziv Video Coding
Our Wyner-Ziv video coding (WZVC) testbed is shown in Fig. 2.5. The input
video sequence is divided into two groups which are coded by two different methods.
Part of the frames are coded using a H.264 INTRA frame coder, which are denoted
as key frames. The remaining frames are independently coded using channel coding
methods, referred to as Wyner-Ziv frames. The INTRA key frames are used to keep
the complexity at the encoder low as well as to generate side information for the
Wyner-Ziv frames at the decoder. Only parts of the parity or syndrome bits from
the channel encoder are transmitted to the decoder. Hence, the decoding of these
frames need side information at the decoder. The side information can be extracted
from the neighboring key frames by extrapolation or interpolation. The increase of
the distance between two neighboring key frames will degrade the quality of the side
information of Wyner-Ziv frame. It is essential to find a good tradeoff between the
number of the key frames and the degradation of the side information. Fig. 2.6 shows
an example of group of pictures (GOP) of Wyner-Ziv video coding. Every other frame
is coded as intra frame and the other frames are coded as Wyner-Ziv frames. The
side information of the Wyner-Ziv frame can be derived from the two neighboring
intra frames.
Two channel coding methods are supported in the system, turbo codes and low-
density-parity-check (LDPC) codes. The Wyner-Ziv frames can be coded either in the
pixel domain or the transform domain with the integer transform used in H.264. The
coefficients are coded bitplane by bitplane. The most significant bitplane is first coded
by the channel encoder, followed by the other bitplanes with less significance. The
entire bitplane is coded as a block with the channel coder. Output of the channel coder
is sent to the decoder. Only part of the parity bits from the encoder are transmitted
through the channel. We assume that the decoder can request more parity bits until
the bitplane is correctly decoded.
42
Bitplane Extraction
Channel Encoder
Channel Decoder
Buffer
Request bits
Wyner-Ziv Frames
Reconstruction
Decoded Wyner-Ziv
Frames
Interpolation or Extrapolation
Side Information
H.264 INTRA Encoder
H.264 INTRA Decoder
Key Frames Decoded Key
Frames
Slepian-Wolf Coder
Encoder Decoder
Transform Inverse
Transform
Fig. 2.5. Wyner-Ziv Video Coding Structure
I I I WZ WZ
Fig. 2.6. An Example of GOP in Wyner-Ziv Video Coding
43
At the decoder, the key frames can be independently decoded by H.264 INTRA
decoder. To reconstruct the Wyner-Ziv frames, the decoder first derives the side
information from the previously decoded key frames. Side information is an initial
estimate or noisy version of the current frame. The incoming channel coder bits
help to reduce the noise and reconstruct the current Wyner-Ziv frame based on this
initial estimate. The decoder assumes a statistical model of the “correlation channel”
to exploit the side information. The difference between the original frame and the
estimate is modeled as a Gaussian or Laplacian distribution. If the system is coded in
transform domain, the side information is also integer transformed. The coefficients
either in the pixel domain or in the transform domain are represented by bitplanes.
The channel decoder uses the side information in the bitplane representation and the
channel coder bits to decode the symbol. If the decoded symbol is consistent with the
side information, the decoded symbol is used for the reconstruction with the other
decoded symbols from different bitplanes. Otherwise, the reconstruction process uses
the side information in bitplane representation as the reconstruction to prevent errors.
If the video sequence is coded in the transform domain, inverse integer transform is
used to recover the sequence after reconstruction.
2.2.2 Channel Codes (Turbo Codes and LDPC Codes)
The testbed supports two channel coding methods, turbo codes and low-density-
parity-check (LDPC) codes. The turbo code is built upon the codec from [108] and
the LDPC code is built upon the codec from [109]. The basic structure of two channel
coders are described in the following.
Turbo Codes
The structure of turbo encoder is shown in Fig. 2.7 [38–41]. The input X is sent to
two identical recursive systematic convolutional (RSC) encoders. Before being trans-
mitted to one of the RSC encoders, the symbols are randomly interleaved. The two
44
Systematic Convolutional Encoder
Systematic Convolutional Encoder
Interleaver
Input X
Systematic Part Discarded
Punctured Parity Bits Transmitted
to Decoder
Systematic Part Discarded 1 P X
2 P X
Fig. 2.7. Structure of Turbo Encoder Used in Wyner-Ziv Video Coding
Table 2.1Generator matrix of the RSC encoders
state g1(D) g2(D) octal form
4 1 + D 1 + D2 (7,5)
8 1 + D2 + D3 1 + D3 (17,15)
16 1 + D + D4 1 + D2 + D3 + D4 (31,27)
RSC encoders are parallel-concatenated and hence termed as Parallel Concatenated
Convolutional Codes (PCCC). In this application, the systematic parts of the output
are discarded and part of the parity bits are sent to the decoder. The puncture deletes
selected parity bits to reduce coding overhead.
The structure of the RSC encoders is simple which guarantees low complexity
encoding. Generally the RSC codes used in the turbo encoder have the generator
matrix
GR(D) = [1g2(D)
g1(D)] (2.8)
Table 2.1 summaries several generator matrices that are frequently used. An example
of the RSC code with 16 states is given in Fig. 2.8.
The structure of the turbo decoder as shown in Fig. 2.9 is computationally com-
plex compared with the encoder. X1P and X2
P denote the parity bits generated by the
two RSC encoders. The input Y denotes the dependency channel output, which is
45
+
+
X
X
X P
Fig. 2.8. Example of a Recursive Systematic Convolutional (RSC) Code
46
Channel Probabilities Calculation
Channel Probabilities Calculation
Interleaver
Deinterleaver
SISO Decoder P channel
P extinsic P a priori
SISO Decoder
P channel
P extinsic P a priori
Interleaver Deinterleaver Decision
Input Y
1 P X
2 P X
X ' P(Y|X)
Fig. 2.9. Structure of Turbo Decoder Used in Wyner-Ziv Video Coding
side information available at the decoder in Wyner-Ziv coding. The decoder includes
two soft-input soft-output (SISO) constituent decoders [38–41].
Low Density Parity Check (LDPC) Codes
Low Density Parity Check code is an error correcting code that operates close to
the Shannon limit [42–45]. In the following, we consider only binary LDPC codes. A
LDPC code is a linear block code determined by a generator matrix G or a parity
check matrix H. Suppose the linear block code is a (N,K) code. G is a K × N
matrix represented as G = [IK : P ] and H is a (N − K) × N matrix represented
as H = [P T : INK. The encoding of the LDPC code is determined by c = GT X
where X is the input vector. All the codewords generated by G satisfy Hc = 0. The
relationship can be represented by the Tanner graph as shown in Fig. 2.10 where a
(7,4) linear code is used. H is a 3 × 7 matrix with entries
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
(2.9)
47
C 0 C 1 C 2 C 3 C 4 C 5 C 6
f 0 f 1 f 2
bit nodes
check nodes
Fig. 2.10. Tanner Graph of a (7,4) LDPC Code
LDPC code has a sparse H, where there are few 1’s in the rows and columns. Reg-
ular LDPC codes contain exactly fixed number of 1’s every column and every row.
Irregular LDPC codes has different numbers of 1’s every row or column.
The LDPC decoder iteratively estimates the distributions in graph-based model,
where the brief propagation algorithm is used. Given the side information Y , the
log-likelihood ratio is
L(x) = logP (x = 0|y)
P (x = 1|y)(2.10)
and the estimate is
x =
0 L(x) ≥ 0
1 L(x) < 0
(2.11)
2.2.3 Derivation of Side Information
The key frames can be decoded independently by the H.264 INTRA frame de-
coder. The previously decoded key frames are used to derive the side information
of the Wyner-Ziv frames by extrapolation or interpolation in pixel domain. There
are some simple ways to obtain the side information. Suppose frame n is the current
frame and frame (n−1) and (n+1) are the neighboring frames. For example, we can
use the previous reconstructed frame (n − 1) as the side information of the current
frame n. Another approach is to take the average of the pixel values from two neigh-
48
boring frames (n− 1) and (n + 1). In these cases, the quality of the side information
is low and there is no motion estimation at the decoder. To obtain higher quality side
information, motion estimation can be done at the decoder which involves high com-
plexity processing. Side information can be obtained by extrapolating the previous
reconstructed frame as shown in Fig. 2.11. For every block in current frame n, we
search for the motion vector MVn−1 of the co-located macroblock in previous frame
(n − 1). For natural scenes, the motion vectors of neighboring frames are closely
related and we can predict the motion vectors of the current frame from the adjacent
previously decoded frames. We use MVn−1 as an estimate of the motion vector of the
current frame MVn. The patterned reference block in frame (n − 1) is derived using
MVn and used as the side information of the current macroblock in frame n.
Frame (n-2) Frame (n-1) Frame n
MV n-1 MV n
Fig. 2.11. Derivation of Side Information by Extrapolation
We can also use interpolation to obtain the side information. As shown in Fig.
2.12, motion search is done between the (n− 1)-th key frame s(n− 1) and (n + 1)-th
key frame s(n + 1). For each block in the current frame, as shown in Fig. 2.12,
the side estimator first uses the co-located block in the next reconstructed frame
s(n + 1) as the source and the previous reconstructed frame s(n− 1) as the reference
to perform forward motion estimation. We denote the obtained motion vector as
MVF . We then use the co-located block in the previous frame as the source and
49
next reconstructed frame as the reference to perform backward motion estimation.
Denote the obtained motion vector as MVB. The side estimator uses MVF
2from
s(n − 1) to find the corresponding reference block PF1, and −MVF
2from s(n + 1) to
find the corresponding reference block PF2. We also use MVB
2from s(n+1) to find the
corresponding reference block PB1, and −MVB
2from s(n−1) to find the corresponding
reference block PB2. The reference block is
P =PF1 + PF2 + PB1 + PB2
4(2.12)
This average of the four references is the initial estimate of the side information.
Frame (n-1) (K frame)
Frame n (WZ frame)
Frame (n+1) (K frame)
F MV B MV
1 F P
2 F P
1 B P
2 B P
) ( ˆ n s ) 1 ( ˆ n s ) 1 ( ˆ n s
Fig. 2.12. Derivation of Side Information by Interpolation
A refined side estimator can be used to more effectively extract reference informa-
tion from the decoder. Many current side estimators use the information extracted
from the previous reconstructed frames. With input of a Wyner-Ziv frame, the de-
coder gradually improves the reconstruction of the current frame. It is possible to
use the information from the current frame’s lower quality reconstruction as well.
This is analogous to SNR scalability used in conventional video coding. In that case,
a previous frame’s reconstruction is first used as a reference, while a lower quality
reconstructions of current frame can later be used as references for the enhancement
50
layers. The detailed implementation of refined side estimator is shown in Fig. 2.13.
After the incoming parity bits are used along with the reference to reconstruct a lower
quality reconstruction of current frame sb(n), the refined side estimator performs a
second motion search. In refined motion search, for every block in sb(n), the best
match in the previous and following key frames respectively is obtained and results
in two new motion vectors MVF,RSE and MVB,RSE. The two best matched blocks
in the adjacent key frames are then averaged to construct a new side information.
The parity bits received are now used for second-round decoding with this new side
information. The new reference uses the information of previous side information and
can further improve the quality of the side information.
RSE F MV ,
RSE B MV ,
) ( ˆ n s b
Frame (n-1) (K frame)
Frame n (WZ frame)
Frame (n+1) (K frame)
Fig. 2.13. Refined Side Estimator
2.2.4 Experimental Results
We compare our implementation with conventional video coding methods. We
test the following coding methods.
• H.263 INTRA: Every frame is coded by H.263+ reference software TMN3.1.1
INTRA mode;
51
• H.264 INTRA: Every frame is coded by H.264 reference software JM8.0 INTRA
mode;
• H.264 IBIB: Every even frame is coded by JM8.0 INTRA mode, while the odd
frames are coded by JM8.0 bi-directional mode with quarter-pixel motion search
accuracy;
• I-WZ: Every even frame is coded by JM8.0 INTRA mode, while the odd frames
are coded as a Wyner-Ziv frame. At the decoder, the side information is derived
by interpolation as shown in Fig. 2.12. Motion search is performed with quarter-
pixel accuracy. Then refine motion search as shown in Fig. 2.13 is performed
to further improve the coding efficiency.
We tested six standard QCIF sequences, Foreman, Coastguard, Carphone, Silent,
Stefan and Table Tennis each of which consists of 300 frames. The frame rate is 30
frames per second. The data rate of H.263 INTRA, H.264 INTRA and H.264 IBIB
is adjusted by the quantization parameter (QP). For I-WZ, we adjust the Wyner-Ziv
frame’s data rate by setting the number of bitplanes used for decoding, while the
data rate of the key frame is controlled by QP. The rate distortion performance is
averaged over 300 frames.
In Fig. 2.14 - 2.19, we show the video coding results. Compared with conventional
INTRA coding results, the Wyner-Ziv video coding generally outperforms H.264 IN-
TRA coding by 2-3 dB and H.263+ INTRA coding by 3-4 dB. This shows that by
exploiting source statistics in the decoder, a simple encoder can achieve better coding
results than independent encoding and decoding methods such as INTRA coding.
Compared with H.264 IBIB, the Wyner-Ziv video coding still trails by 2-4 dB.
52
300 400 500 600 700 800 90028
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.14. WZVC Testbed: R-D Performance Comparison (Foreman QCIF)
2.3 Rate Distortion Analysis of Motion Side Estimation in Wyner-Ziv
Video Coding
In this section we study the rate-distortion performance of motion side estimation
in Wyner-Ziv video coding (WZVC) [90, 110]. There are three terms leading to the
performance loss of Wyner-Ziv coding compared to conventional MCP-based coding.
System loss is due to the fact that side information is unavailable at the encoder for
WZVC. Source coding loss is caused by the inefficiency of channel coding methods
and quantization schemes that cannot achieve Shannon limit. We will focus on the
third item, video coding loss, in the following analysis. Video coding loss is due to the
fact that the side information is not perfectly generated at the decoder. For MCP-
based video coding, the reference of the current frame is generated from the previous
reconstructed neighboring frames and the current frame. However, WZVC generates
53
200 300 400 500 600 700 800 900 100026
28
30
32
34
36
38
40
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.15. WZVC Testbed: R-D Performance Comparison (Coastguard QCIF)
the reference only from the previous reconstructed neighboring frames without access
to the current frame.
The rate analysis of residual frame is formulated based on a general power spec-
trum model [65–67,90,110] and it would be applied to WZVC later. Assume e(n) =
s(n) − c(n), where e(n) denotes the residual frame, s(n) denotes the original source
frame, and c(n) denotes the reference frame. The power spectrum of the residual
frame is:
Φee(ω) = Φss(ω) − 2Re{Φcs(ω)} + Φcc(ω)
Φcs(ω) = Φss(ω)E{e−jωT ∆} = Φss(ω)e−12ωT ωσ2
∆
Φcc(ω) = Φss(ω)
Φee(ω) = 2Φss(ω) − 2Φss(ω)e−12ωT ωσ2
∆
Φee(ω)
Φss(ω)= 2 − 2e−
12ωT ωσ2
∆ (2.13)
54
300 400 500 600 700 800 900 100028
30
32
34
36
38
40
42
44
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.16. WZVC Testbed: R-D Performance Comparison (Carphone QCIF)
where ∆ is the error motion vector and σ2∆ is the variance of the error motion vector.
The error motion vector is the difference between the derived motion vector and the
true motion vector, where the true motion vector is an ideal motion vector with
minimum distortion. The rate saving over INTRA-frame coding by MCP or other
motion search methods is [111]
∆R =1
8π2
∫ π
−π
∫ π
−π
log2
Φee(ω)
Φss(ω)dω (2.14)
Hence the rate difference between two systems using two motion vectors is
∆R1,2 =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e−12ωT ωσ2
∆1
1 − e−12ωT ωσ2
∆2
dω (2.15)
Wyner-Ziv video coding is compared with two conventional MCP-based video
coding methods, i.e., DPCM-frame video coding and INTER-frame video coding.
DPCM-frame coding subtracts the previous reconstructed frame from the current
frame and codes the difference. INTER-frame coding performs motion search at the
55
100 200 300 400 500 600 700 800 900 1000 1100 120026
28
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.17. WZVC Testbed: R-D Performance Comparison (Silent QCIF)
encoder and codes the residual frame. The rate difference between the three coding
methods using (2.15) are
∆RDPCM,WZ =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e−12ωT ωσ2
MV
1 − e−12ωT ω(1−ρ2)σ2
MV
dω (2.16)
∆RDPCM,INTER =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e−12ωT ωσ2
MV
1 − e− 1
2ωT ω(1−ρ2)σ2
∆β
dω (2.17)
where σ2MV denotes the variance of the motion vector, ρ denotes the correlation be-
tween the true motion vector and the motion vector obtained by the side estimator,
σ2∆β
denotes the variance of the motion vector error. The rate savings over DPCM-
frame video coding by using Wyner-Ziv video coding is more significant when the
motion vector variance σ2MV is small. This makes sense since for lower motion vector
variance, the side estimator has a better chance to estimate a motion vector close
to the true motion vector. Wyner-Ziv coding can achieve a gain up to 6 dB (for
small motion vector variance) or 1-2 dB (for normal to large motion vector variance)
56
200 400 600 800 1000 1200 140022
24
26
28
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.18. WZVC Testbed: R-D Performance Comparison (Stefan QCIF)
over DPCM-frame video coding. INTER-frame coding outperforms Wyner-Ziv video
coding by around 6 dB generally. For sequences with small σ2MV , the improvement is
smaller, ranging from 1 - 4 dB depending on the specific side estimator used.
We further study the side estimators using two motion search methods, sub-pixel
motion search and multi-reference motion search. In conventional MCP-based video
coding, the accuracy of the motion search has a great influence on the coding efficiency.
However, Wyner-Ziv video coding is not as sensitive to the accuracy of the motion
search. For small σ2MV , motion search using integer pixel falls behind the method
using quarter pixel by less than 0.4 dB. The coding difference with larger σ2MV is even
smaller. In this case, using 2 : 1 subsampling does not affect the coding efficiency
significantly. Fig. 2.20 shows an example of Foreman QCIF sequence. The half pixel
search accuracy can improve around 0.2-0.3 dB over integer motion search accuracy.
Quarter pixel search accuracy fails to have noticeable improvement over half pixel
57
200 300 400 500 600 700 800 900 1000 1100
28
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
I−WZH.264 INTRAH.263 INTRAH.264 IBIB
Fig. 2.19. WZVC Testbed: R-D Performance Comparison (Table Tennis QCIF)
search accuracy. Considering 2:1 subsampled motion search, it only incurs 0.1 dB
coding loss compared to integer motion search accuracy. The experimental results
are consistent with our analytical result. When the decoder complexity becomes an
issue, the 2:1 subsampled side estimator can be an acceptable alternative. The results
for other sequences show similar patterns.
The rate difference between N references and one reference is
∆RN,1 =1
8π2
∫ π
−π
∫ π
−π
log2IMR(ω,N)dω (2.18)
and
IMR(ω,N) =N+1
N− 2e−
12ωT ωσ2
∆a + N−1N
e−(1−ρ∆ωT ωσ2∆a)
2 − 2e−12ωT ωσ2
∆a
(2.19)
where ρ∆ is the correlation between two motion vector errors and we consider the case
ρ∆ = 0. σ2∆a denotes the actual variance of the motion vector error, which is due to the
motion search pixel inaccuracy and the imperfect correlation between current motion
vectors and previous motion vectors. The analysis of the rate difference using N
58
200 300 400 500 600 700 800 900 1000
26
28
30
32
34
36
Data Rate (kbps)
PS
NR
(dB
)
Motion Search Pixel Accuracy (Foreman QCIF)
WZVC with Integer−Pixel Side EstimationWZVC with Half−Pixel Side EstimationWZVC with Quarter−Pixel Side EstimationWZVC with (2:1) Subsampled Side Estimation
Fig. 2.20. Wyner-Ziv Video Coding with Different Motion SearchAccuracies (Foreman QCIF)
references over one reference shows that multi-reference motion search can effectively
improve the rate distortion performance of Wyner-Ziv video coding. Fig. 2.21 shows
the result for Foreman QCIF sequence. Using five references can improve the coding
efficiency by 0.5-1 dB than using one reference, while using ten references does not
have further noticeable improvement over using five references. A similar observation
can be obtained from other sequences.
The experimental results confirm the above theoretical analysis. Current Wyner-
Ziv video coding schemes still fall far behind the state-of-the-art video codecs. A
better motion estimator at the decoder is essential to improve the performance.
59
300 400 500 600 700 800 90028
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
Multi−Reference Motion Search (Foreman QCIF)
WZVC with SE−1WZVC with SE−2WZVC with SE−5WZVC with SE−10H.264 INTRAH.263 INTRAH.264 IB
Fig. 2.21. Wyner-Ziv Video Coding with Multi-reference MotionSearch (Foreman QCIF)
2.4 Wyner-Ziv Video Coding with Universal Prediction
Wyner-Ziv video coding using channel coding methods described above are gener-
ally a reverse-complexity system where the decoder has a high burden of complexity.
In some scenarios, low complexity at both the encoder and decoder is desirable. For
example, wireless handheld cameras/phones belong to the case. To solve the prob-
lem, a transcoder can be used as the intermediate part of the system. The usage
of the transcoder would increase the cost of the transmission and delay. The goal
of this section is to design a Wyner-Ziv video coding approach with low complexity
encoder and decoder. To address the problem, we introduce the idea of universal
prediction [91,92]. The definition of universal prediction is stated in [112],
60
Roughly speaking, a universal predictor is one that does not depend on the unknown
underlying model and yet performs essentially as well as if the model were known in
advance.
The prediction problem in general can be formulated as to predict xt based on
previous data xt−1 = (x1, x2, · · · , xt−1). The associated loss function λ(·) is used to
measure the distance between xt and the predicted version zt = xt. If the statistic
model of the data is well studied, classical prediction theory can be used. However,
for natural video sequences, the statistical model is unknown. A universal predictor
can be used to predict the future data based on previous data in this case. Merhav
and Ziv have shown that in certain cases, a Wyner-Ziv rate-distortion bound can be
achieved without binning instead by universal compression in [113].
We use the universal predictor in Wyner-Ziv video coding as the side estimator
[91,92]. The replacement of block-based motion estimation side estimator by universal
side estimator can reduce the decoder complexity dramatically. Each video frame is
formulated as a vector and the pixel values at the same spatial position are grouped
as I(k, l), where (k, l) is the spatial coordinate. Denote one sequence of I(k, l) as X =
x1, x2, · · · , xt, where t is the temporal index of the sequence. Denote the estimator
of X as Z = z1, z2, · · · , zt. The transition matrix from X to Z is denoted as∏
=∏
(i, j)i,j∈[0,255], where∏
(i, j) is the probability when the input is i and the estimator
is j. The loss function is denoted as Λ(i, j), where i is the input in X and j is the
corresponding estimator in Z. Denote the conditional probability on the context as
P (zt = α|z1, z2, · · · , zt−1. The optimal estimator zt is the one minimizing the expected
loss. In the extreme case when
Λ(i, j) = (i − j)2 (2.20)
and
∏
(xt, zt) =
1 xt = zt
0 xt 6= zt
(2.21)
the optimal estimator is a weighted average of the previous occurrence with the same
context.
61
As shown in Fig. 2.22, for each pixel in frame t, we use N previous frames
as contexts. In the setup of the experiment, we set N = 4 and the context is
(zt−4, zt−3, zt−2, zt−1). At the decoder, the universal prediction side estimator searches
for the occurrence of the context in the previous decoded frames. The optimal side
estimator of the current frame is the average of the previous occurrence. For ex-
ample, suppose the context for current pixel is (210, 211, 210, 211). In the previous
frames, the context (210, 211, 210, 211) occurs three times with the following pixel
values as 210, 211 and 212 respectively. Therefore, the estimator of the current pixel
is (210 + 211 + 212)/3 = 211.
Frame (t-3) Frame (t-2) Frame (t-1) Frame (t)Frame (t-4)
tz
1−tz
2−tz3−t
z4−t
z
Fig. 2.22. Universal Prediction Side Estimator Context
Fig. 2.23 shows the results for six sequences. We compare the side estimator with
universal prediction with the side estimator with block-based motion estimation. We
also include the reference frame of H.264 integer motion search for comparison. For all
six sequences, the H.264 reference has the best performance among three approaches.
Side estimator with motion estimation generally has better quality than side esti-
mator with universal prediction except for the mobile sequence. For coastguard and
foreman sequences, the side estimator with motion estimation performs significantly
better than the side estimator with universal prediction. In these two sequences,
the linear motion model assumption in the side estimator with motion estimation is
closely matched. In the other four sequences, the universal prediction side estimator
has comparable performance with the motion estimation based side estimator. Con-
62
sidering the low complexity in the universal prediction side estimator, this method
shows great potential. The coding efficiency of Wyner-Ziv video coding using the
universal prediction side estimator would be improved if we can refine the correlation
model between the original frame and side information.
63
0 100 200 300 400 500 600 70021
22
23
24
25
26
27
28
29
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Carphone
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
0 100 200 300 400 500 600 700 800 900 1000 110020
21
22
23
24
25
26
27
28
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Coastguard
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
0 100 200 300 400 500 600 70020
21
22
23
24
25
26
27
28
29
30
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Foreman
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
0 200 400 600 800 1000 1200 1400 160017
18
19
20
21
22
23
24
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Mobile
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
0 50 100 150 200 250 300 350 400 450 50022
23
24
25
26
27
28
29
30
31
32
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Mother and Daughter
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
0 50 100 150 200 250 30021
22
23
24
25
26
27
28
29
30
31
32
Reference Data Rate (kbps)
Sid
e In
form
atio
n P
SN
R(d
B)
Salesman
Side estimator with universal predictionSide estimator with motion estimationH.26x integer pixel motion search
Fig. 2.23. Side Estimator by Universal Prediction
64
3. BACKWARD CHANNEL AWARE
WYNER-ZIV VIDEO CODING
3.1 Introduction
Conventional motion-compensated prediction (MCP) based video compression
performs motion estimation at the encoder. A typical encoder requires more compu-
tational resources than the decoder. The latest video coding standard H.264 adopted
many new coding tools to improve video compression performance and this leads to
further complexity increases. While this approach meets the requirements of most
applications, it poses a challenge for some applications such as video surveillance,
where the encoder has limited power and memory, while the decoder has access to
more powerful computational resources. In these applications a simple encoder is
preferred with computationally intensive parts left to the decoder.
A Wyner-Ziv video codec generally formulates the video coding problem as an
error correction or noise reduction problem. A Wyner-Ziv encoder usually encodes
a frame independently using a channel coding method and sends the parity bits to
the decoder. Frames encoded this way are referred to as Wyner-Ziv frames. Prior
to decoding a Wyner-Ziv frame, the decoder first analyzes the video statistics based
on its knowledge of the previously decoded frames and derives the side information
for the current frame. This side information serves as the initial estimate, or noisy
version, of the current frame. With the parity bits from the encoder, the decoder can
gradually reduce the noise in the estimate. Hence the quality of the initial estimate
plays an important role in the decoding process. A simple and widely used way to
derive the side information is to either extrapolate or interpolate the information from
the previously decoded frames as described in Section 2.2.3. The advantage of frame
extrapolation is that the frames can be decoded in sequential order, and hence every
65
frame (except the first few frames) can be coded as Wyner-Ziv frames. However, the
quality of the side estimation from the extrapolation process may be unsatisfactory.
This has led to research on more sophisticated extrapolation techniques to improve
the side estimation. Many Wyner-Ziv coding methods also resort to the use of frame
interpolation, which generally produces higher quality side estimates. The problem
with interpolation is that it requires some frames, after the current frame in the
sequence order, be decoded before the current frame. This means that at least some
of these frames, referred to as key frames, cannot be coded as a Wyner-Ziv frame.
Instead, they should be coded by conventional methods. Since we need to keep the
encoder computationally simple, these frames are often INTRA coded, which costs
more data rate than the predictive coding methods. One way to alleviate this problem
is to increase the distance between two key frames. However, with the increase in
distance, the side estimation quality quickly degrades. The results show that larger
key frame distances can only marginally improve the overall coding efficiency, and
sometimes even leads to worse coding performance. It is for this reason that many
Wyner-Ziv methods code every other frame as Wyner-Ziv frames and key frames
alternately.
One concern for Wyner-Ziv video coding is its coding performance compared to
state-of-the-art video coding, such as H.264. In conventional video coding, most
frames are coded as predictive frames (P Frame) or bi-directionally predictive frames
(B Frame) and only very few are coded as INTRA frames due to the large amount
of temporal redundancy in a video sequence. INTRA frames consume many more
bits than P frames and B frames to achieve identical quality, since they do not take
advantage of the temporal correlation across frames. In many Wyner-Ziv video coding
schemes, many frames are INTRA coded to guarantee enough side information at the
decoder. This inevitably leads to compromise between coding efficiency and encoding
complexity.
In this chapter, we address this problem using Wyner-Ziv video coding with Back-
ward Channel Aware Motion Estimation to improve the coding efficiency while main-
66
taining low-level complexity for the encoder. The rest of this chapter is organized
as follows. Section 3.2 gives an overview of Backward Channel Aware Motion Esti-
mation. Section 3.3 discusses the system of Wyner-Ziv video coding with Backward
Channel Aware Motion Estimation. Error resilience in backward channel is discussed
in Section 3.4. Simulation details and performance evaluations are given in Section
3.5.
3.2 Backward Channel Aware Motion Estimation
Motion-compensated-prediction (MCP) based video coding can efficiently remove
the temporal correlation of the video sequence and achieve high compression efficiency.
However, motion estimation is highly complex. It is not suitable for power constrained
video terminals, such as wireless devices. Wyner-Ziv video coding with Backward
Channel Aware Motion Estimation is based on Network-driven motion estimation
(NDME) proposed in [114] by Rabiner and Chandrakasan. Network-driven motion
estimation was first proposed for wireless video terminals. In NDME the motion
estimation task is moved to the decoder.
Fig. 3.1 shows the basic diagram of network-driven motion estimation, which
is a combination of motion prediction (MP) and conditional replenishment (CR).
Conditional replenishment is a standard low complexity video coding method without
motion estimation. It codes the difference between the current frame and the previous
frame. We can consider it as the special case of INTER coding with zero motion
vector. CR is efficient for reducing the temporal correlation for slow-motion video
sequences. For video sequences with high motion, conditional replenishment may not
be able to provide high compression efficiency. Motion prediction makes use of the
assumption of constant motion in the video sequence. The computationally intensive
operation of motion estimation is moved to the high power decoder, which may be
at the base station or in the wired network [114]. To estimate the motion vector of
frame n at the decoder, motion estimation is performed on the reconstructed frames
67
Motion Estimation
Motion Compensation
Frame (n-2)
Frame (n-1) PMV Diff
Diff -
Var
Var
Pref_CR
> ? <
2 CR a
2 MP a
Fig. 3.1. Adaptive Coding for Network-Driven Motion Estimation (NDME)
(n − 1) and (n − 2) to find the motion vector of frame (n − 1). Then the predicted
motion vector (PMV) of frame n is estimated by
PMV (n) = MV (n − 1) (3.1)
PMV shows a high correlation with the true motion vector derived at the encoder.
CR is a preferable choice for low-motion part since it does not need to send the
predicted motion vectors back to the encoder. Therefore, the NDME scheme uses
adaptive coding to choose between motion prediction and conditional replenishment.
The choice is made at the decoder and hence the adaptive scheme improves the coding
efficiency without adding computational complexity at the encoder. Since CR saves
bits to send the predicted motion vector, it is a more favorable choice when there is
little motion. Denote the variances of MP and CR as a2MP and a2
CR respectively. MP
mode is chosen when
a2MP < a2
CR − Pref CR (3.2)
where Pref CR is a constant bias parameter towards the selection of CR.
Fig. 3.2 shows the flow chart of network-driven motion estimation algorithm. The
first two frames are INTRA coded. The motion vector of nth frame is derived from the
previous reconstructed frames at the decoder. Then the decoder adaptively chooses
68
Refine MV
INTRA ?
INTRA Coding
Code MCP Difference
CR ? MP
PMV(n)=MV(n-1)
Motion Estimation
PMV or CR signal
encoder decoder
Fig. 3.2. Network-Driven Motion Estimation (NDME)
motion prediction or conditional replenishment. A signal indicating the choice is sent
back to the encoder along with the predicted motion vector if MP mode is chosen.
No matter which mode is sent, the encoder refines the motion vector to further find
the more accurate motion vector. The refinement of the motion vector is performed
by searching the ±1 pixel positions around the received motion vector. If a scene
change is detected or the encoder and decoder lose synchronization, an INTRA frame
is inserted to refresh the sequence. Experimental results show that NDME can reduce
the encoder complexity significantly with a slight increase in the data rate compared
with encoder-based motion estimation.
3.3 Backward Channel Aware Wyner-Ziv Video Coding
In this section we extend Wyner-Ziv video coding by coding the key frames with
the use of Backward Channel Aware Motion Estimation (BCAME). We shall refer
to our Wyner-Ziv video coding method based on BCAME, as BCAWZ. The basic
idea of BCAME is to perform motion estimation at the decoder and send the motion
69
information back to the encoder through a backward channel, which is similar to the
idea of NDME in Section 3.2. Hence we are able to improve the coding efficiency of
the key frames and the side estimation quality without significantly increasing the
encoder complexity.
For natural video sequences, the motion of objects is continuous and adjacent
video frames are closely correlated. Therefore, it is possible to predict the motion
of the current frame using the information of its adjacent frames. In conventional
MCP-based video coding, the current frame is accessible to the encoder and the
motion vector is estimated by comparing the current frame with the reference frame.
In BCAME, the motion search is performed at the decoder without accessing the
current frame. The motion vectors are sent back to the encoder through a feedback
channel. For a sequence we encode the first and third frames as INTRA frames.
All the other odd frames are coded with BCAME, and we refer to these backward
predictively coded frames as BP frame. All the even frames are coded as a Wyner-Ziv
frame. A BP frame is coded as follows. Assuming the two BP frames prior to the
current BP frame, as shown in Fig. 3.3, have been decoded at the decoder. For each
block in the current BP frame, we use its co-located block in one of the previous two
BP frames as the source and the other BP frame as the reference. Block based motion
search is done at the decoder to estimate the motion vector. The motion vectors are
sent back to a motion vector buffer at the encoder through the backward channel
that is usually available in most Wyner-Ziv methods. This buffer is updated when
the next frame’s motion vectors are received. At the encoder, we use the received
motion vectors with the previous reconstructed BP frames to generate the motion
compensated reference for the current BP frame. The residue between the current
BP frame and its motion compensated reference is then transformed and entropy
coded. Depending on which of the previous decoded BP frames is used as the source
(or the reference) at the decoder, we can at least obtain two sets of motion vectors
as shown in Fig. 3.3 and 3.4.
70
3.3.1 Mode Choices in BCAME
Mode I: Forward Motion Vector
Mode I is shown in Fig. 3.3. Frame A and B are the previous two reconstructed
BP frames stored in the frame buffer at the decoder. The temporal distances between
adjacent frames are denoted as TDAB and TDBC . To find the motion vector in the
current macroblock, we use the motion information of the colocated macroblock in
the previous frame by assuming that a constant translational motion velocity remains
across the frames. For each block in the current frame, we use the co-located block
in B and search for its best match in A and obtain the forward motion vector MVF .
Assuming linear motion field, the motion vector for the block in the current BP frame
is then
MVI =TDBC
TDAB
MVF
Since we code the BP frame and Wyner-Ziv frame alternately, TDAB = TDBC = 2,
MVI = MVF .
Mode II: Backward Motion Vector
The second mode is obtained in a similar way as Mode I but we use the co-located
frame in A as the source, as shown in Fig. 3.4. We search for the best matched block
in B. This motion vector is referred to as the backward motion vector MVB. Again
assuming linear motion, the motion vector for the current frame is
MVII = −TDAC
TDAB
MVB
Here TDAB = 2 and TDAC = 4, MVII = −2MVB.
Mode III: Encoder-Based Mode Selection
These two sets of motion vectors are sent back to the encoder, where the original
current BP frame is available. The encoder can then do a mode selection to choose
71
Reference A Reference B Current
current MB
co-locatted MB
MV I MV F
Ref A Ref B Curr
TD AB TD BC
MV I MV F
Fig. 3.3. Mode I: Forward Motion Vector for BCAME
Reference A Reference B Current
current MB
co-locatted MB
MV II
MV B
Ref A Ref B Curr
TD AB
TD AC
MV B
MV II
Fig. 3.4. Mode II: Backward Motion Vector for BCAME
72
the best matched motion vector based on metrics such as mean squared error (MSE)
or sum of absolute difference (SAD).
Optimal Mode = arg mink∈{I,II}
∑
(i,j) D[x(i, j) − x(k)(i, j)]
N × N, (3.3)
where k denotes the index of the modes, x(i, j) denotes the original pixel value in
the position (i, j), x(k)(i, j) denotes the reconstructed pixel value using mode k, N
represents the size of the macroblock, and the summation is over all the pixels of
the current macroblock. According to this measurement of fidelity, we obtain the
optimal mode with highest peak signal-to-noise ratio (PSNR). The mode decision
result is sent to the decoder along with the transform coefficients of the current BP
frame. We refer to this mode as Mode III. Although this mode selection scheme uses
equation (3.3) to make the decision and thus places more computational load on the
encoder compared to using Mode I or Mode II only, the experimental results shown
in Fig. 3.6 and 3.7 prove that it is worth the additional work.
Compared to other Wyner-Ziv video coding methods, BCAWZ provides an ef-
ficient approach to predictively code the key frames without greatly increasing the
encoder complexity. We note that since these two motion vectors are also needed
to generate the interpolated side estimate at the decoder for the Wyner-Ziv frame
between A and B, hence the increase of the decoder’s complexity is marginal.
3.3.2 Wyner-Ziv Video Coding with BCAME
The key idea behind BCAWZ is to INTER-code the key frames that were INTRA-
coded without significantly increasing the encoder’s computational complexity. A
Wyner-Ziv video coding scheme with BCAME is shown in Fig. 3.5. For a BP frame,
after the motion vector is received from the decoder, the motion-compensated refer-
ence is generated and the residual frame is obtained. This residual is transformed and
entropy coded as in H.264. Wyner-Ziv frames are also coded in the transform domain.
Every Wyner-Ziv frame is coded with the integer transform proposed in H.264 [1].
Each coefficient is then represented by 11 magnitude bits and a sign bit. The bits
73
Wyner-Ziv Frames Bitplane
Extraction Channel Encoder
Channel Channel Decoder
Reconstruction
Reconstructed Wyner-Ziv
Frames
Reconstructed (i-2)-th Frame
Reconstructed (i-4)-th Frame
Motion Estimation
Reconstructed (i-2)-th Frame
Motion Compensation
BP Frames
Motion Compensation
Motion Vector
Reconstructed i-th Frame
Request Bits
Interpolation
Side Information
Parity Bits
Residual Frame
Frame Buffer
Frame Buffer
Channel
Encoder Decoder
Transform Inverse
Transform
Quantization/ Transform
+ +
+ +
+ + +
+ -
Motion Vector Buffer
Entropy Encoder
Entropy Decoder
Dequant & Inv. Transform
Dequant & Inv. Transform
Fig. 3.5. Backward Channel Aware Wyner-Ziv Video Coding
at the same bitplanes are coded with a low-density parity-check (LDPC) code. The
parity bits for each bitplane are sent to the decoder in the descending order of bit
significance. At the decoder, a side estimate is generated by interpolating the adja-
cent key frames with the motion search. The side estimate is then transform coded
and represented by bitplanes. Parity bits generated from the corresponding bitplanes
at the encoder are used to correct the errors in each bitplane, also in the descending
order of bit significance, until the total rate budget for this frame is achieved. We
should note that the backward channel is connectionless, which transmits the motion
vectors from the decoder to the encoder. The encoder does not provide any feedback
to the decoder upon receiving the motion vectors.
3.4 Error Resilience in the Backward Channel
When video data is transmitted over the network, error-free delivery of the data
packets are typically unrealistic due to either traffic congestion or impairments of
the physical channel. The errors can also lead to de-synchronization of the encoder
74
and the decoder. Error resilient video coding techniques [115] have been developed
to mitigate transmission errors. Conventional motion-compensated prediction (MCP)
based video coders, such as H.264 and MPEG-4, include several error resilience meth-
ods [14, 116–118]. Methods to address the problem of forward channel errors have
been extensively studied. We now consider the scenario when the backward channel
is subjected to only erasure errors or delays. The case of a purely noisy backward
channel is not considered here. We also assume the backward channel is one way and
connectionless. Since the motion vectors sent back to the encoder play a crucial role
in predictive coding, it is important to make sure the motion vectors are resilient to
transmission errors and delays. In an error free coding scenario, the decoder sends
the motion vectors of the i-th frame, denoted as MVi. The encoder receives MVi
and generates the residual frame RFi and sends the bitstream through the forward
channel. At the decoder, the decoder reconstructs the frame by the received RFi and
the stored motion vector MVi. This changes when there is an erasure error or delay at
the backward channel. In this case, the motion vector is not updated and the encoder
continues to use the motion vectors of the (i−2)-th frame. The encoder generates the
residual frame denoted as RF′i with MVi−2. The decoder reconstructs the frame with
the residual data RF′i and the motion vector MVi. Thus the reconstructed frames at
the encoder and the decoder lose synchronization, which causes the drift problem and
can propagate to the rest of the sequence.
To address this problem, a two-stage adaptive coding procedure is proposed. We
first propose a simple resynchronization method. A synchronization marker is used to
provide a periodic synchronization check. An entry denoting the index of the frame
is inserted before sending the motion information in the bit stream. The bandwidth
needed to send this extra field is negligible. With this index, the encoder can quickly
report an error when the received index does not match the index at the encoder.
The encoder then codes the key frames adaptively based on the decision of the
synchronization detector. When desynchronization is detected, the encoder ignores
the motion information and codes the key frames as an INTRA frame. This frame
75
type selection decision is sent to the decoder and the decoder decodes this frame as
an INTRA frame. For the next key frames, the decoder continues to send the motion
vectors back. After synchronization is reestablished, the encoder resumes coding the
key frames as BP frames.
3.5 Experimental Results
We implemented our scheme based on the H.264 reference software JM8.0 [119].
The results of Foreman, Coastguard, Carphone, and Mobile QCIF sequences are used
form our experiments. Each sequence consists of 300 frames at 30 frame/second. The
rate-distortion (R-D) results include both the key frames and Wyner-Ziv frames. The
objective visual quality is measured by the peak signal to noise ratio (PSNR) of the
luminance component.
Fig. 3.6-3.9 show the R-D performances of the four sequences. We first compare
the results of BCAWZ with Mode I and Wyner-Ziv video coding with INTRA coded
key frame. It is shown that by using BCAWZ, the performance can improve by 1
- 2dB. The improvement is less significant at the high data rate than the low data
rate. This is because at the low data rate the video quality is more dependent on
the reference quality. With the mode selection in Mode III, the performance can be
further improved by 0.5-2 dB. When BCAWZ is compared with conventional video
coding, we find that BCAWZ can achieve 4-5 dB gain over H.264 INTRA coding.
However, compared with the state-of-the-art predictive coding, BCAWZ still trails
H.264 by as much as 4-6 dB, where the H.264 results are coded with the I-B-P-B-P
frame structure with quarter pixel motion search and only the first frame is INTRA
coded. Generally the performance of BCAWZ is better for slow motion sequences,
such as Carphone sequence, where the motion vectors of the neighboring frames are
continuous and the correlation of the neighboring motion vectors are higher. Fig. 3.10
shows an example of comparisons of BCAWZ (mode III) and WZ with INTRA key
frames. The test sequence is Foreman CIF sequence with 30 frames/second and both
76
are coded at 511 kbits/second. The frames shown here are the 20th frame, which is a
key frame in both scenarios. Fig. 3.10-(a) is an INTRA coded key frame with PSNR
28.6552 dB. Fig. 3.10-(b) is a BP frame with PSNR 34.6404 dB. The average PSNR
difference for the entire sequence is 3.5 dB.
Video sequences contain both spatial and temporal redundancy. In INTRA cod-
ing, only spatial redundancy is reduced for the sequences. Backward channel aware
motion estimation removes the temporal correlation as INTER coding does, which
achieves better coding efficiency compared to INTRA coding. Using BCAME, BCAWZ
can achieve 1-3 dB gain on top of Wyner-Ziv video coding schemes with INTRA key
frames [29] [106]. However, BCAWZ contains some discontinuous motion artifacts,
such as some small blockiness.
0 200 400 600 800 1000
26
28
30
32
34
36
38
40
42
44
46
Data Rate (kbps)
PS
NR
(dB
)
Foreman QCIF
H.264 I−B−P−B−PBCAWZ (Mode III)BCAWZ (Mode I)WZ−INTRAH.264 INTRA
Fig. 3.6. BCAWZ: R-D Performance Comparison (Foreman QCIF)
Fig. 3.11 shows the backward channel bandwidth as a percentage of the forward
channel bandwidth. For both sequences with Mode I, the backward channel band-
77
0 100 200 300 400 500 600 700 80025
30
35
40
Data Rate (kbps)
PS
NR
(dB
)
Coastguard QCIF
H.264 I−B−P−B−PBCAWZ (Mode III)BCAWZ (Mode I)WZ−INTRAH.264 INTRA
Fig. 3.7. BCAWZ: R-D Performance Comparison (Coastguard QCIF)
width is 5-8% of the forward channel at lower data rate. This percentage reduces
below 3% for mid to higher data rate. To use the backward motion vectors, Mode III
needs roughly twice as much backward channel bandwidth as Mode I. Such backward
channel usage can be readily satisfied in many communication systems.
Practical bandlimited channels generally suffer from various types of degradations
such as bit/erasure errors and synchronization problems. In the following we study
the error resilience performance for the backward channel. We assume that there is no
channel error while sending the parity bits. We test the case when the motion vectors
for the 254th frame in the Foreman sequence are delayed by two frames. Without the
signal of frame number, the encoder does not update the motion vector buffer and
still uses the motion vectors from previous frames. We also test a one-frame motion
vector loss in the 200th frame in the Coastguard sequence. In this scenario, the motion
vectors of the 200th frame are all lost and the encoder still uses the motion vectors
78
0 100 200 300 400 500 600
26
28
30
32
34
36
38
40
42
Data Rate (kbps)
PS
NR
(dB
)
Carphone QCIF
H.264 I−B−P−B−PBCAWZ (Mode III)BCAWZ (Mode I)WZ−INTRAH.264 INTRA
Fig. 3.8. BCAWZ: R-D Performance Comparison (Carphone QCIF)
at the buffer from the previous frame without synchronization detection. In the
experiment we observe that when a delay or erasure occurs, the video coding efficiency
without error resilience sharply drops. The quality degradation continues until the
end of the sequence even though the motion vectors of the following BP frames are
correctly received. This is because, as described in Section 3.4, when a delay occurs,
the encoder uses a different set of motion vectors, hence the reconstructed frame at
the encoder is different from the decoder. As this reconstructed frame is used as
reference for the following frames, the drift propagates across the sequence. If there
are more than one-frame motion vectors loss or delay, the desynchronization problem
would be even worse due to the mismatch of the reconstructed reference frames. In
contrast, the adaptive coding scheme can detect the desynchronization and insert an
INTRA frame, hence stopping the drift propagation. In Fig. 3.12 and Fig. 3.13, the
79
0 500 1000 1500
20
25
30
35
40
Data Rate (kbps)
PS
NR
(dB
)
Mobile QCIF
H.264 I−B−P−B−PBCAWZ (Mode III)BCAWZ (Mode I)WZ−INTRAH.264 INTRA
Fig. 3.9. BCAWZ: R-D Performance Comparison (Mobile QCIF)
(a) 20th Frame (WZ with INTRA Key Frames ) (b) 20th Frame (BCAWZ)
Fig. 3.10. Comparisons of BCAWZ and WZ with INTRA Key Framesat 511KBits/Second (Foreman CIF)
80
100 200 300 400 500 600 700 800 9000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Forward Channel Data Rate (kbps)
(Bac
kwar
d C
hann
el B
andw
idth
)/(F
orw
ard
Cha
nnel
Ban
dwid
th)
Backward Channel Bandwidth vs. Forward Channel Bandwidth
Foreman (Mode I)Foreman (Mode III)Coastguard (Mode I)Coastguard (Mode III)
Fig. 3.11. Backward Channel Usage in BCAWZ
proposed error resilience scheme can effectively recover from backward channel delay
or erasure errors. It only incurs a very small penalty in terms of R-D performance
compared with the results with an error free backward channel.
81
300 400 500 600 700 800 900 100029
30
31
32
33
34
35
36
37
38
39
40
Data Rate (kbps)
PS
NR
(dB
)
Foreman QCIF with Error Resilience
BCAWZ Mode I(Error Free)BCAWZ w/ Error ResilienceBCAWZ w/o Error Resilience
Fig. 3.12. R-D Performance with Error Resilience (Foreman QCIF)(Motion Vector of the 254th Frame is Delayed by Two Frames)
82
300 400 500 600 700 800 90029
30
31
32
33
34
35
36
37
Data Rate (kbps)
PS
NR
(dB
)
Coastguard QCIF with Error Resilience
BCAWZ Mode I(Error Free)BCAWZ w/ Error ResilienceBCAWZ w/o Error Resilience
Fig. 3.13. R-D Performance with Error Resilience (Coastguard QCIF)(Motion Vector of the 200th Frame is Lost)
83
4. COMPLEXITY-RATE-DISTORTION ANALYSIS OF
BACKWARD CHANNEL AWARE WYNER-ZIV VIDEO
CODING
4.1 Introduction
Many Wyner-Ziv video coding schemes encode a video sequence into two types of
frames, key frames and Wyner-Ziv frames. Key frames are encoded using conventional
video coding methods such as H.264 and Wyner-Ziv frames are encoded using channel
coding techniques [30,49]. At the decoder, the reconstructed key frames serve as side
information used to reconstruct the Wyner-Ziv frames.
In Chapter 3, we presented a WZVC scheme that uses Backward Channel Aware
Motion Estimation (BCAME) to encode the key frames, where motion estimation
was performed at the decoder and motion information was sent back to the encoder.
We refer to these backward predictively coded frames as BP frames. As shown in
Chapter 3, the coding efficiency of a Wyner-Ziv video codec significantly depends
on the coding efficiency of the key frames. This leads to the need for complexity-
rate-distortion (CRD) optimization for applications where the encoder is subject to
limited computational resources.
Complexity constrained rate distortion analysis and optimization for video coding
has been of interest in the research community. In [120], CRD analysis for a motion-
compensated prediction (MCP) based video encoding is developed by modeling the
complexity and rate-distortion tradeoff for different encoding parameters. Another
MCP-based CRD analysis is proposed in [121] by considering the complexity of dif-
ferent types of mode selection. In [122], CRD analysis is proposed for a wavelet based
video encoder by modeling the complexity for different spatiotemporal decomposition
structures.
84
In this chapter we present a model to quantitatively analyze the complexity and
rate-distortion tradeoff for BP frames. Three estimators, the minimum estimator, the
median estimator and the average estimator, are proposed and analyzed.
4.2 Overview of Complexity-Rate-Distortion Analysis in Video Coding
Rate-distortion (R-D) analysis has been studied in the research community for a
long time. Recently the complexity factor is taken into consideration, which is closely
related to the power consumption for the video coding terminals. In a power con-
strained device, complexity management is veryimportant to maintain the durability
and efficiency. Frameworks to measure the complexity are proposed for different sce-
narios [120–124]. 3D models to study the relationship between rate, distortion and
complexity are presented to optimize the tradeoff among the three factors. In this
section we give an overview of complexity-rate-distortion analysis.
4.2.1 Power-Rate-Distortion Analysis for Wireless Video Communication
In wireless video communication, power constraint is a challenge to the battery-
operated devices. A power constrained rate-distortion analysis is done in [125]. The
power considered includes video processing power as well as the power consumption in
channel coding and transmission. The joint power control and bit allocation schemes
can achieve significant power saving compared with schemes without optimization.
Another framework to study the power-rate-distortion (P-R-D) relationship is
presented in [120]. Generally wireless appliances operate with limited energy, hence it
is essential to manage the energy consumption. The model evaluates P-R-D behavior
of the conventional video coding system. Guided by the model, the system is able
to automatically configure the control parameters given a power supply level. The
power here refers to video processing power only.
The typical video encoder has three major modules: ME (motion estimation),
PRECODING, and ENC (entropy encoding). The data representation module PRE-
85
CODING includes DCT, IDCT, QUANT (quantization), DQUANT (inverse quanti-
zation) and RECON (reconstruction) modules.
The complexity control parameter set consists of three parameters {x, y, z}. The
complexity of ME is determined by the number of SSD/SAD performed every frame.
The ME control parameter x is defined as:
x =λME
λmaxME
(4.1)
where λME denotes the number of sum of absolute difference (SAD) computed per
frame and λmaxME is the maximum value of λME. The complexity of PRECODING is
determined by the number of nonzero MBs. The PRECODING control parameter y
is defined as:
y =λPRE
M(4.2)
where λPRE denotes the number of nonzero macroblocks in the frame and M is the
total number of macroblocks in a frame. The complexity of ENC is determined by
the data rates. The other control parameter z is defined as:
z =λF
fmax
(4.3)
where λF is the encoding frame rate and fmax is the maximum frame rate with a de-
fault value of 30 frames/s. The complexity is transformed to power for measurement.
The final R-D optimized power control problem is formulated as:
min{x,y,z}
Dv(R; x, y, z) = (1 − z)2(β0 + β1) (4.4)
+ 2(2z − z2)(β0 + β1e−β2x) × [
1
2(1 − y)2 + y(1 + a0y) · 2−2γ R
y ]
s.t.Φ(P ) = z(C1x + c2y + C3R)
where Dv(R; x, y, z) denotes the video presentation quality that is closed related to
the distortion, R denotes the coding data rate in the unit of bits per pixel, P de-
notes power consumption, and γ is a model constant related to the distortion. Φ(P )
defines the relationship between the power consumption and the complexity control
86
parameters with constants C1, C2, and C3. ME model parameters β0, β1, and β2
are estimated from the statistics of previous frames. Since the typical operational
time of the power supply is at least several hours, the power control parameters are
re-evaluated and adjusted every few seconds or longer.
4.2.2 H.264 Encoder and Decoder Complexity Analysis
There have also been several decoder-base complexity-rate-distortion analysis.
The initial result to analyze the computational complexity of H.26L decoder is pre-
sented in [126]. Interpolation for the sub-pixel positions and deblocking filter are
reported as the most time consuming processes. Platform-independent optimization
of the two units is also proposed. A comprehensive complexity analysis for H.264
baseline profile is presented in [123]. The H.264 decoder is shown in Fig. 1.3 which
consists of a series of sub-functions. After entropy decoding, different types of syntax
elements in the bitstream are de-multiplexed. Important header information is ex-
tracted to determinate the mode. The residual coefficients are obtained after inverse
quantization and inverse transform. Depending on the mode, a spatial prediction
or temporal prediction is generated from previous reconstructed pixel values. It is
added to the residue to form the reconstructed value for the current block. A table
is formed to study the frequency of occurrence for every sub-function. Another table
is established to compute the numbers of basic operation units (addition, multiplica-
tion, shift) for every sub-function. The complexity of the decoder is measured by the
number of basic operation units. Using this metric, the decoder complexity of H.264
is approximately two to three times that of H.263.
A joint evaluation of performance and complexity for H.264 is presented in [127].
The processing time is used as the metric of complexity. Both the encoder and the
decoder complexity are measured. The memory access frequency (total number of
data transfers from/to memory per second) and the memory usage are also considered.
Eighteen different configurations are set up for the experimental results, where the
87
search range of motion estimation, block size, rate-distortion optimization method,
and entropy coding method are defined for every configuration. The complexity
is normalized by the simplest configuration and hence complexity measurement is
platform-free. Compared with previous standards (H.263 and MPEG-4), the decoder
complexity of H.264 increases by a factor of two and the encoder increase is larger
than one order of magnitude.
A model to compute the complexity of decoding for a generic system is discussed
in [124]. Two complexity-function variables are defined to model the decoding com-
plexity, the percentage of nonzero transform coefficients and the percentage of decoded
motion vectors out of the maximum number of possible motion vectors. The com-
plexity of entropy decoding and inverse transform is related to the first variable. The
complexity of motion compensation and fractional-pixel interpolation is related to
the second variable. The estimation of computational complexity is quantified with
the number of integer addition and multiplications per pixel. With the model, the
multimedia receiver can adaptively negotiate with the server for a bitstream with a
specific decoding complexity.
4.2.3 Complexity Scalable Motion Compensated Wavelet Video Encod-
ing
There is also research on complexity analysis of wavelet based video coding. A
comprehensive framework for the analysis of wavelet-based video encoding complexity
is presented in [122]. The complexity is measured in terms of the number of motion
estimation (ME) computations. Since motion-compensated temporal filter (MCTF)
wavelet video coding schemes have inherent scalability, a closed-form expression with
different coding options is formulated. The expression shows that the complexity is
related to the layers(depth) of wavelet decomposition. Using the formulation, the
complexity under specific settings can be predicted independent of the video con-
tent. Therefore, it can be used to optimize the tradeoff between rate, distortion
88
and complexity. The optimized system can achieve significant reduction of complex-
ity compared to full implementation with an insignificant penalty in rate-distortion
performance.
4.3 Backward Channel Aware Wyner-Ziv Video Coding
As discussed in Chapter 3, we propose a Wyner-Ziv video coding with backward
channel aware motion estimation (BCAME). We show the basic coding diagram in
Fig. 4.1. The basic idea of BCAME is to perform motion estimation at the decoder
and send the motion information back to the encoder through a backward channel.
The motion information is used to encode the key frames at the encoder. The frames
coded in this way are referred to BP frames. Hence, we can exploit the temporal
correlation of the video sequence while shifting the motion estimation task from the
encoder to the decoder. Wyner-Ziv frames are coded in the transform domain using
channel codes, such as turbo codes or low-density-parity-check (LDPC) codes. Only
part of the parity bits or syndrome bits are sent to the decoder. At the decoder, the
side information is derived from the previously decoded key frames. The derived side
information is used as the initial estimate to decode the Wyner-Ziv frames with the
parity bits received from the encoder. The derivation of side information generally
involves motion estimation and the motion information can be used for BCAME,
hence the additional computational complexity at the decoder is marginal.
A BCAME encoder can generate the motion compensated frames using the motion
vectors received by the backward channel. If several motion vectors are sent to the
encoder, the encoder can make a mode decision based on the distortion and choose the
best available motion vector. For BP frames, the residual frame between the original
frame and motion compensated reference frame is encoded in the similar way as an
H.264 encoder. Compared to the INTRA coded frames, BP frames can significantly
improve the coding efficiency with minimal usage of the backward channel. Compared
89
Wyner-Ziv Frames Bitplane
Extraction Channel Encoder
Channel Channel Decoder
Reconstruction
Reconstructed Wyner-Ziv
Frames
Reconstructed (i-2)-th Frame
Reconstructed (i-4)-th Frame
Motion Estimation
Reconstructed (i-2)-th Frame
Motion Compensation
BP Frames
Motion Compensation
Motion Vector
Reconstructed i-th Frame
Request Bits
Interpolation
Side Information
Parity Bits
Residual Frame
Frame Buffer
Frame Buffer
Channel
Encoder Decoder
Transform Inverse
Transform
Quantization/ Transform
+ +
+ +
+ + +
+ -
Motion Vector Buffer
Entropy Encoder
Entropy Decoder
Dequant & Inv. Transform
Dequant & Inv. Transform
Fig. 4.1. Backward Channel Aware Wyner-Ziv Video Coding
90
to the P frames, BP frames moves the computationally intensive task of motion
estimation to the decoder and results in a reduction of complexity at the encoder.
The computational complexity of a Wyner-Ziv video encoder includes that of
Wyner-Ziv frames and BP frames. Since the complexity of Wyner-Ziv frames at the
encoder are relatively insignificant, we will focus on the analysis of the BP frames.
The computational complexity of BP frames can vary depending upon the number
of candidate motion vectors received at the encoder. When N motion vectors are
received, the encoder will need to compare the distortions resulted from these N
motion vectors. In terms of computational complexity, this is comparable to a motion
search when there are only N candidate motion vectors in the search area, whereas
the traditional P frames require a motion search for every pixel and sub-pixel inside
its search window.
4.4 Complexity-Rate-Distortion Analysis of BP Frames
4.4.1 Problem Formulation
Assume there are N motion vectors estimated at the decoder. Denote these N
motion vectors as MV1, MV2, · · · , MVN . We assume that the N motion vectors are
2-D independent and identically distributed (i.i.d.) random variables [128] having
the joint probability density function (pdf) f(x, y) and the cumulative probability
distribution function (cdf) F (x, y). Denote the true motion vector as MVT = (xn, yn).
A true motion vector is an ideal motion vector with minimum mean squared error
between the original frame and the reference frame.
Signal power spectrum and Fourier analysis tools are used to analyze the rate
distortion performance in [129]. We extend the idea to apply it for Wyner-Ziv video
coding. As shown in [90, 110] the rate difference between two coders using different
motion vectors is,
∆R1,2 =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e−12ωT ωσ2
∆1
1 − e−12ωT ωσ2
∆2
dω (4.5)
91
where σ2∆i
(i = 1, 2) are the variances of the error motion vectors. The error motion
vector is the difference between the derived motion vector and the true motion vector.
In the following we analyze three different motion estimators, namely, the minimum
motion estimator, the medium motion estimator, and the average motion estimator.
4.4.2 The Minimum Motion Estimator
In this case, all N i.i.d. motion vectors are sent to the encoder. At the encoder,
each of the N motion vectors is used with the reference frame for motion compen-
sation. The motion vector leading to the minimum distortion between the original
frame and motion compensated reference frame is selected. The minimum motion
estimator matches Mode III as described in Section 3.3.1, where N = 2. As in many
traditional fast motion search methods, we assume the motion field is homogeneous
and unimodal, then the motion search is equivalent to choosing the motion vector
nearest to the true motion vector. Hence the corresponding error motion vector is
(X − xn, Y − yn), where X and Y are the horizontal and vertical motion vectors. We
use capital letters to denote a random variable unless otherwise specified.
We introduce a new random variable Z to model the distance between the received
motion vector and the true motion vector,
Z =√
(X − xn)2 + (Y − yn)2 (4.6)
Hence the problem can be formulated as searching for the motion vector with the
smallest Z.
This problem can be solved using order statistics and extreme value theory [130,
131]. More specifically, let X1, X2, · · · , Xn be i.i.d. random variables. Denote
X(1), X(2), · · · , X(n) as the corresponding order statistics, where the first order statis-
tic X(1) is the minimum of X1, X2, · · · , Xn. The probability density function of the
kth order statistic can be formulated as [130]
fX(k)(x) =
n!
(k − 1)!(n − k)!F (x)k−1[1 − F (x)]n−kf(x) (4.7)
92
The distribution of the motion vectors are highly dependent on the motion esti-
mation method and the content of the sequence. We now consider the case when the
received motion vectors have a 2-D Gaussian distribution as used in [18,129],
fX,Y (x, y) =1
2πσ2e−
(x−x′n)2+(y−y′n)2
2σ2 (4.8)
where σ2 is the motion vector variance, x′n and y′
n are the mean of the horizontal and
vertical motion vectors respectively. This assumption matches the study in [132,133]
that shows the distribution of the motion vectors have a small variance at the center
area. To facilitate our discussion, we denote the deviation from the true motion vector
as δx = x′n − xn and δy = y′
n − yn.
Case I: δx 6= 0 or δy 6= 0
In this case the means of the motion vectors sent back from the decoder are
different from the true motion vector. The random variable Z as defined in (4.6) is a
Rician distribution,
fZ(z) =z
σ2exp(−z2 + ν2
2σ2)I0(
νx
σ2) z ≥ 0 (4.9)
where I0(t) = 12π
∫ π
−πet cos θdθ is a modified Bessel function and ν2 = δ2
x + δ2y .
The distribution of the first order statistic of Z, or the smallest Z, can be derived
from (4.7) with k = 1. A plot of Z(1) with different values for N is shown in Fig. 4.2.
The variance of the error motion vectors with N motion vectors sent back from
the decoder is
σ2∆N
=
∫ ∞
0
z2fZ(1)(z)dz =
∫ ∞
0
z2N [1 − FZ(z)]N−1fZ(z)dz (4.10)
where fZ(z) is derived in (4.9) and FZ(z) is the cumulative probability distribution
function of fZ(z). From (4.5), the rate difference in using N motion vectors compared
to using only one motion vector
∆R1,N =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e− 1
2ωT ωσ2
∆N
1 − e−12ωT ωσ2
∆1
dω (4.11)
93
0 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Z(1)
f Z(1
)(z)
ν = 1, σ2 = 1
N=1N=2N=3N=4N=10
Fig. 4.2. The Probability Density Function of Z(1)
The numerical rate difference with ν = 1 and ν = 12
are shown in Fig. 4.3 and 4.4
respectively for various σ2. Fig. 4.3 shows that sending more motion vectors through
the backward channel can improve the coding efficiency of BP frames. For example,
in the case of ν = 1, σ2 = 1, compared to only sending one motion vector to the
encoder, sending five motion vectors can lead to a rate saving of 0.2276 bit/sample,
or an improvement of roughly 0.2276× 6.02 = 1.3702 dB in peak signal-to-noise ratio
(PSNR). It is noted that this rate saving is more significant when σ2 is smaller. When
σ2 is large (such as σ2 = 10), sending more motion vectors to the encoder has very
limited impact on rate savings. In other words, when there is fast or large irregular
motion in a video sequence, sending more motion vectors is not justified. We also note
that when ν is smaller, i.e, when the motion vectors extracted at the decoder from
94
2 4 6 8 10 12 14 16 18 20−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.3. Rate Difference of the Minimum Motion Estimator (δx = δy =√
22
, ν = 1)
the previous reconstructed frames are closer to the true motion vector on average,
the rate saving is more significant.
Case II: δx = 0 and δy = 0
In this case, the mean of the received motion vector is identical to the true motion
vector. This is a special case of (4.9) with ν = 0. And the distribution of Z as defined
in (4.6) is a Rayleigh distribution with
fZ(z) =z
σ2e−
z2
2σ2 , z ≥ 0 (4.12)
FZ(z) = 1 − e−z2
2σ2 , z ≥ 0 (4.13)
95
2 4 6 8 10 12 14 16 18 20−1
−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.4. Rate Difference of the Minimum Motion Estimator (δx = δy =√
24
, ν = 12)
96
2 4 6 8 10 12 14 16 18 20−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2=1
σ2=2
σ2=5
σ2=10
Fig. 4.5. Rate Difference of the Minimum Motion Estimator (δx = δy = 0, ν = 0)
The first order statistic and the variance of the error motion vectors
fZ(1)(z) = N [1 − FZ(z)]N−1f(z) = N
z
σ2(e−z2/2σ2
)N
σ2∆N
= E[Z2(1)] =
∫ ∞
0
z2fZ(1)(z)dz
=
∫ ∞
0
z2Nz
σ2(e−z2/2σ2
)Ndz =2
Nσ2 (4.14)
The rate difference with respect to the single motion vector is
∆R1,N =1
8π2
∫ π
−π
∫ π
−π
log2
1 − e−12ωT ω( 2
Nσ2)
1 − e−12ωT ω(2σ2)
dω (4.15)
The results are shown in Fig. 4.5. Compared to Fig. 4.3 and 4.4, the rate saving is
higher than Case I when ν 6= 0.
In terms of computational complexity, the motion search complexity of the mini-
mum motion estimator is linear with the number N of the motion vectors sent back
from the decoder. Since the encoder complexity in this case depends largely on the
the motion search complexity, (4.11) and (4.15) describe not only the rate-distortion
97
tradeoff but also the complexity-rate-distortion tradeoff for the minimum estimator.
Furthermore, we note that the backward channel bandwidth usage can also assume
to be linear with the number N , hence (4.11) and (4.15) also describe the tradeoff
between the rate distortion performance and the backward channel bandwidth usage.
4.4.3 The Median Motion Estimator
The minimum motion estimator in Section 4.4.2 requires that N motion vectors
should be sent through the backward channel and the encoder then needs to make
a motion search to find the best candidate motion vector. This leads to higher
requirement on encoding complexity and backward channel bandwidth. When the
application cannot satisfy this requirement, a motion vector needs to be selected at
the decoder among the N motion vector candidates and only this motion vector will
be sent to the encoder.
One way to choose such a motion vector is to use the median motion vector among
the N motion vectors. The median motion vector is constructed using the median of
the horizontal motion vectors and the median of the vertical motion vectors. When
N is odd, i.e., N = 2p + 1, we choose m = (p + 1)th order statistic. When N is
even (N = 2p), pth order statistic is chosen. Consider the Gaussian case in (4.8), the
horizontal and vertical motion vectors are independent with
fX(x) =1√2πσ
e−(x−x′n)2
2σ2 (4.16)
fY (y) =1√2πσ
e−(y−y′n)2
2σ2 (4.17)
Since X and Y follow the same distributions, we will discuss X only and the perfor-
mance of Y can be analyzed similarly.
The cumulative probability distribution function of the Gaussian distribution in
(4.16) can be expressed as
FX(x) =
∫ x
−∞
1√2πσ
e−(t−x′
n)2/2σ2
dt = 1 − 1
2erfc(
x − x′n√
2σ) (4.18)
98
Table 4.1
The variance of the error motion vectors for the minimum and medianmotion estimators (ν = 0, σ2 = 1)
N 1 3 5 7 9
minimum 2.000 0.667 0.400 0.286 0.222
median 2.000 0.897 0.574 0.421 0.332
where the error function erfc(x) is defined as erfc(x) = 2√π
∫ ∞x
e−t2dt. The probability
density function of the median order statistic is
fX(m)(x) =
N !
(m − 1)!(N − m)!F (x)m−1[1 − F (x)]N−mf(x)
=N !
(m − 1)!(N − m)![1 − 1
2erfc(
x − x′n√
2σ)]m−1
× [1
2erfc(
x − x′n√
2σ)]N−m 1√
2πσe−
(x−x′n)2
2σ2 (4.19)
where the error function erfc(x) is defined as erfc(x) = 2√π
∫ ∞x
e−t2dt. Since we are
interested in the distance to the true motion vector xn, we define a new variable
Xd = X − xn; the distribution of Xd and the variance of the error motion vector
E[X2d ], according to (4.7), are
fXd(x) =
N !
(m − 1)!(N − m)![1 − 1
2erfc(
x − δx√2σ
)]m−1
× [1
2erfc(
x − δx√2σ
)]N−m 1√2πσ
e−(x−δx)2
2σ2 (4.20)
E[X2d ] =
∫ ∞
−∞x2fXd
(x)dx (4.21)
Since X and Y are independent, the total variance of the error motion vector σ2∆ =
E[X2d ] + E[Y 2
d ]. Table 4.1 gives a summary of σ2∆N
for the case ν = 0, σ2 = 1. The
median motion estimator yields larger variance of the error motion vectors than the
minimum estimator.
The rate difference with respect to the single motion vector is shown in Fig. 4.6-
4.8. Comparing the results of the minimum motion estimator and the median motion
99
1 3 5 7 9 11 13 15 17 19−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.6. Rate Difference of the Median Motion Estimator (δx = δy =√
22
, ν = 1)
1 3 5 7 9 11 13 15 17 19−0.4
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.7. Rate Difference of the Median Motion Estimator (δx = δy =√
24
, ν = 12)
100
1 3 5 7 9 11 13 15 17 19−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2=1
σ2=2
σ2=5
σ2=10
Fig. 4.8. Rate Difference of the Median Motion Estimator (δx = δy = 0, ν = 0)
101
estimator, the rate saving of the median motion estimator is smaller than that of the
minimum motion estimator. For example, when ν = 1, σ2 = 1 and N = 5, using the
minimum estimator can achieve a rate saving of 0.2276 bit/sample, while using the
median estimator can only achieve a rate saving of 0.0570 bit/sample. It is interesting
to note that in Fig. 4.6, when N is large (N ≥ 12), the rate saving when σ2 = 2 is
actually greater than the σ2 = 1 case. This is because when the motion vector error
is closer to the motion vector variance, while the rate saving will continue to improve
with the increase of N , such improvement will grow at a slower pace. Fig. 4.8 shows
the rate difference for the case δx = δy = 0. In this case, the rate saving is generally
more significant than the case ν > 0. And the increase in N can also noticeably
improve the coding efficiency. In this case, the use of multiple motion vectors in the
median motion estimator provides a more significant rate saving.
4.4.4 The Average Motion Estimator
Another way to choose the motion vector candidates at the decoder is to send the
average of the motion vectors, which is referred to as the average motion estimator.
With the N motion vectors available at the decoder, we send the average motion
vector X = 1N
(X1 + X2 + · · · + XN) and Y = 1N
(Y1 + Y2 + · · · + YN) to the encoder.
The sample average of a Gaussian distribution is also Gaussian [134], so
X ∼ N(x′n,
σ2
N), Y ∼ N(y′
n,σ2
N) (4.22)
The variance of the error motion vector is
σ2∆N
= E[(X − xn)2] + E[(Y − yn)2]
= (δ2x +
σ2
N) + (δ2
y +σ2
N) = ν2 + 2
σ2
N(4.23)
The result of the variance is shown in Table 4.2 with comparisons to the minimum and
median motion estimators (ν = 1, σ2 = 1). The variance of the error motion vectors
using the average motion estimator is higher than the minimum motion estimator.
Compared with the median motion vector, the variance using the average motion
102
Table 4.2
Comparisons of the variance of the error motion vectors for the min-imum, median and average motion estimators
ν = 1, σ2 = 1 ν = 12, σ2 = 1
2
N Minimum Median Average Minimum Median Average
1 3.0000 3.0000 3.0000 1.2500 1.2500 1.2500
3 1.0519 1.8973 1.6667 0.4233 0.6987 0.5833
5 0.6406 1.5737 1.4000 0.2550 0.5368 0.4500
7 0.4609 1.4209 1.2857 0.1825 0.4604 0.3929
9 0.3600 1.3322 1.2222 0.1421 0.4161 0.3611
estimator is lower. The rate difference with respect to the single motion vector using
the average motion estimator is shown in Fig. 4.9-4.11. With the same requirement
of the encoder complexity and the bandwidth of the backward channel, the average
motion estimator gives better rate distortion performance than the median motion
estimator. Similar to Fig. 4.6 and for the same reason mentioned in Section 4.4.3,
we can observe a similar crossover effect in Fig. 4.9 between the curves of σ2 = 1
and σ2 = 2 when N ≥ 8. In the case ν = 0, (4.23) reduces to σ2∆N
= 2σ2
N, which
is identical to (4.14). In other words, when ν = 0, the average motion estimator
performs as well as the minimum motion estimator. While when ν 6= 0, the minimum
motion estimator can achieve higher coding efficiency at the cost of higher encoding
complexity and more backward channel bandwidth.
4.4.5 Comparisons of Minimum, Median and Average Motion Estimators
Fig. 4.12 shows two examples of the comparisons of three motion estimators. The
green curve denotes the minimum motion estimator, the red curve denotes median
motion estimator, and the black curve denotes the average motion estimator. In Fig.
4.12-(b), ν is small and σ2 is large, there are not significant differences between the
103
2 4 6 8 10 12 14 16 18 20−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.9. Rate Difference of the Average Motion Estimator (δx = δy =√
22
, ν = 1)
2 4 6 8 10 12 14 16 18 20−0.45
−0.4
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2 = 1
σ2 = 2
σ2 = 5
σ2 = 10
Fig. 4.10. Rate Difference of the Average Motion Estimator (δx = δy =√
24
, ν = 12)
104
2 4 6 8 10 12 14 16 18 20−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
σ2=1
σ2=2
σ2=5
σ2=10
Fig. 4.11. Rate Difference of the Average Motion Estimator (δx = δy = 0, ν = 0)
105
three estimators. In Fig. 4.12-(a), the minimum motion estimator performs much
better than the other two motion estimators. The rate-distortion performance of the
minimum motion estimator is best among the three motion estimators. And typically
the average motion estimator performs better than the median estimator. However,
the minimum motion estimator has higher requirements on the encoder complexity
and the backward channel bandwidth.
106
2 4 6 8 10 12 14 16 18 20−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
v = 1
AverageMedianMinimum
(a) δx = δy =√
22
σ2 = 1
2 4 6 8 10 12 14 16 18 20−0.2
−0.18
−0.16
−0.14
−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
N
Rat
e di
ffere
nce
w.r
.t. s
ingl
e M
V (
bit/s
ampl
e)
v = 1/2
AverageMedianMinimum
(b) δx = δy =√
24
σ2 = 10
Fig. 4.12. Comparisons of the Minimum, Median and Average Motion Estimators
107
5. CONCLUSIONS
The advance of video compression and wireless communication systems has made it
possible to capture and transmit video signals almost any time and anywhere. A
new approach to video coding, known as Wyner-Ziv video coding, was proposed in
recent years. The main characteristic of Wyner-Ziv coding is that side information
is available only at the decoder. Sensor networks and mobile video are some of
the applications where Wyner-Ziv coding may be very useful. These applications are
characterized the need for reduced encoding complexity due to the limited availability
of resources (e.g. battery life and memory). In this thesis, we studied the new
approaches for low complexity video coding. Complexity-rate-distortion tradeoff of
the new approach is extensively analyzed.
5.1 Contributions of The Thesis
The main contributions of the thesis are:
• Wyner-Ziv Video Codec Architecture Testbed
We studied the Wyner-Ziv video codec architecture and built a Wyner-Ziv video
coding testbed. The video sequences are divided into two different parts using
different coding schemes. Part of the sequence is coded by conventional INTRA
coding scheme and referred as key frames. The other part of the sequence is
coded as Wyner-Ziv frames using channel coding methods. Both turbo codes
and low-density-parity-check (LDPC) codes are supported in the system. The
sequences can be coded either in pixel domain and transform domain. Only
parts of the parity bits from the channel coders are transmitted to the de-
coder. Hence, the decoding of the Wyner-Ziv frames need side information at
the decoder. The side information can be extracted from the key frames by
108
extrapolation or interpolation. We study various methods for side information
generation. We also analyze the rate-distortion performance of Wyner-Ziv video
coding compared with conventional INTRA or INTER coding.
• Backward Channel Aware Wyner-Ziv Video Coding
In Wyner-Ziv video coding, many key frames have to be INTRA coded to keep
the complexity low at the encoder and provide side information for the Wyner-
Ziv frames. However, the usage of INTRA frames limits the coding performance.
We propose a new Wyner-Ziv video coding method that uses backward channel
aware motion estimation to code the key frames. The main idea is to do motion
estimation at the decoder and send the estimated motion vector back to the
encoder. In this way, we can keep the complexity low at the encoder with
minimum usage of the backward channel. Our experimental results show that
the scheme can significantly improve the coding efficiency by 1-2 dB compared
with Wyner-Ziv with INTRA-coded key frames. We also propose to use multiple
modes of motion decision, which can further improve the coding efficiency by
0.5-2 dB. When the backward channel is subject to erasure errors or delays,
the coding efficiency of our method decreases. We provide an error resilience
technique to handel the situation. The rate distortion performance of the error
resilience technique can avoid the quality degradation due to channel errors and
only incurs a small penalty compared with error free channel.
• Complexity-Rate-Distortion Analysis of Backward Channel Aware
Wyner-Ziv Video Coding
In the previous section, we study the rate-distortion performance of backward
channel aware Wyner-Ziv video coding. We further present a model to study the
complexity-rate-distortion coding efficiency of backward channel aware Wyner-
Ziv video coding. We present three motion estimator: minimum motion estima-
tor, median motion estimator and average motion estimator. Suppose we have
several motion vectors derived at the decoder. The minimum motion estima-
109
tor sends all the motion vectors derived at the decoder to the encoder and the
encoder makes a decision to choose the best motion vector with smallest dis-
tortion. When the backward channel bandwidth cannot meet the requirement
or the encoder complexity becomes a concern, the median or average motion
estimator chooses one motion vector at the decoder. The results show the rate-
distortion performance of the average estimator is generally higher than that
of the median estimator. If the rate-distortion tradeoff is the only concern, the
minimum estimator yields better results than the other two estimators. How-
ever, for applications with complexity constraints, our analysis shows that the
average estimator could be a better choice. Our proposed model quantitatively
describes the complexity-rate-distortion tradeoff among these estimators.
5.2 Future Work
In this thesis, we proposed to use backward channel aware motion estimation in
Wyner-Ziv video coding. In the future, we could study various ways to improve the
accuracy of motion vectors. For example, we could refine the received motion vector
by searching near its neighbors. When making the mode decision at the encoder, we
could consider the rate factor to make a rate-distortion optimized decision. We could
also insert intra blocks to BP frames where the correlation between the neighboring
frames are weak.
We also presented a model used to study the complexity-rate-distortion coding
efficiency of backward channel aware Wyner-Ziv video coding. For future work, we
will investigate the use of our model to examine the impact of INTRA-coded frames
and INTER-coded frames. The comparisons of INTRA frame, BP frame and INTER-
frame can be done for their CRD performance. Based on the CRD model, we could
optimize the coding structure to minimize the Lagrange cost of the rate and distortion
with a complexity constraint.
110
The backward channel aware Wyner-Ziv video coding would be suitable for video
surveillance applications, where a low complexity encoder is needed and feedback
channel is available between the cameras and the base station. We plan to study
the requirements of a practical video surveillance system and build and refine the
backward channel aware Wyner-Ziv video coding structure to fulfill the task.
LIST OF REFERENCES
111
LIST OF REFERENCES
[1] G. J. Sullivan and T. Wiegend, “Video compression: From concepts to theH.264/AVC standard,” Proceedings of the IEEE, vol. 93, no. 1, pp. 18– 31,January 2005.
[2] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards:Algorithms and Architecture, 2nd ed. Boston, MA: Kluwer Academic Publish-ers, 1997.
[3] A. Bovik, Ed., Handbook of Image and video Processing. San Diego, CA:Academic Press, 2000.
[4] D. T. Lee, “JPEG2000: Retrospective and new developments,” Proceedings ofthe IEEE, vol. 93, no. 1, pp. 32–41, January 2005.
[5] T. Sikora, “Trends and perspectives in image and video coding,” Proceedings ofthe IEEE, vol. 93, no. 1, pp. 6– 17, January 2005.
[6] G. K. Wallace, “The JPEG still picture compression standard,” Communica-tions of the ACM, vol. 34, no. 4, pp. 30–44, April 1991.
[7] “Information technology−JPEG 2000 image coding system−Part 1: Core cod-ing system,” Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC), ISO/IEC15444-1:2000 and ITU-T T.800.
[8] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still imagecoding system: An overview,” IEEE Transactions on Consumer Electronics,vol. 46, no. 4, pp. 1103–1127, November 2000.
[9] A. M. Tekalp, Digital Video Processing. Prentice Hall, 1995.
[10] B. Girod, “Lecture Notes of EE398 - Image Communication II, Stanford Uni-versity,” http://www.stanford.edu/class/ee398/, Spring 2003-2004.
[11] Video Coding for Low Bit Rate Communications, ITU-T RecommendationH.263 Version 2, January 1998.
[12] G. Cote, B. Erol, M. Gallant, and F. Kossentini, “H.263+: video coding at lowbit rates,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 8, no. 7, pp. 849–866, November 1998.
[13] Coding of Audiovisual Objects - Part 2: Visual, ISO/IEC ISO/IEC 14496-2(MPEG-4), 1999.
[14] R. Talluri, “Error-resilient video coding in the ISO MPEG-4 standard,” IEEECommunications Magazine, vol. 36, no. 6, pp. 112–119, June 1998.
112
[15] Y. Wang, J. Ostermann, and Y.-Q. Zhang, Video Processing and Communica-tions. Upper Saddle River, NJ: Prentice Hall, 2002.
[16] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, “Overview of theH.264/AVC video coding standard,” IEEE Transactions on Circuits and Sys-tems for Video Technology, vol. 13, no. 7, pp. 560– 576, July 2003.
[17] G. J. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced videocoding standard: Overview and introduction to the fidelity range extensions,”in Proceedings of the SPIE: Applications of Digital Image Processing XXVII,San Jose, CA, January 18-22 2004, pp. 454–474.
[18] B. Girod, “Efficiency analysis of multihypothesis motion-compensated predic-tion for video coding,” IEEE Transactions on Image Processing, vol. 9, no. 2,pp. 173–183, February 2000.
[19] G. J. Sullivan, J.-R. Ohm, A. Ortega, E. J. Delp, A. Vetro, and M. Barni,“dsp forum - future of video coding and transmission,” IEEE Signal ProcessingMagazine, vol. 23, no. 6, pp. 76–82, November 2006.
[20] M. Bosch, F. Zhu, and E. J. Delp, “Spatial texture models for video compres-sion,” in Proceedings of the IEEE International Conference on Image Process-ing, San Antonio, TX, September 2007.
[21] L. Liu, Y. Liu, and E. J. Delp, “Content-adaptive motion estimation for efficientvideo compression,” in Proceedings of the SPIE International Conference onVisual Communications and Image Processing, San Jose, California, January28 - February 1 2007.
[22] ——, “Enhanced intra prediction using context-adaptive linear prediction,” ac-cepted by the Picture Coding Symposium, Lisbon, Portugal, November 2007.
[23] G. Abdollahian and E. J. Delp, “Finding regions of interest in home videosbased on camera motion,” in Proceedings of the IEEE International Conferenceon Image Processing, San Antonio, TX, September 2007.
[24] Y. Chen, C. Lettsome, M. J. Smith, and E. J. Delp, “A low bit-rate video codingapproach using modified adaptive warping and long-term spatial memory,” inProceedings of SPIE International Conference on Visual Communications andImage Processing, San Jose, CA, January 28 - Feburary 1 2007.
[25] A. Aaron, S. Rane, , E. Setton, and B. Girod, “Transform-domain Wyner-Zivcodec for video,” in Proceedings of the SPIE: Visual Communications and ImageProcessing, San Jose, CA, Januaray 2004.
[26] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,”IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, July1973.
[27] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding withside information at the decoder,” IEEE Transactions on Information Theory,vol. 22, no. 1, pp. 1–10, January 1976.
[28] A. Aaron and B. Girod, “Compression with side information using turbo codes,”in Proceedings of the IEEE Data Compression Conference (DCC), Snowbird,UT, April 2-4 2002, pp. 252–261.
113
[29] A. Aaron, E. Setton, and B. Girod, “Towards practical Wyner-Ziv coding ofvideo,” in Proceeding of IEEE International Conference on Image Processing,ICIP-2003, Barcelona, Spain, September 14-17 2003, pp. 869–872.
[30] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed videocoding,” Proceedings of the IEEE: Special Issue on Advances in Video Codingand Delivery, vol. 93, no. 1, pp. 71– 83, January 2005.
[31] J. Garcia-Frias and Y. Zhao, “Compression of correlated binary sources us-ing turbo codes,” IEEE Communications Letters, vol. 5, no. 10, pp. 417–419,October 2001.
[32] A. D. Liveris, Z. Xiong, and C. N. Georghiades, “Compression of binary sourceswith side information at the decoder using LDPC codes,” IEEE Communica-tions Letters, vol. 6, no. 10, pp. 440 – 442, October 2002.
[33] A. Liveris, Z. Xiong, and C. Georghiades, “A distributed source coding tech-nique for correlated images using turbo-codes,” IEEE Communications Letters,vol. 6, no. 9, pp. 379–381, September 2002.
[34] R. Puri and K. Ramchandran, “Prism: An uplink-friendly multimedia cod-ing paradigm,” in Proceedings of the IEEE International Conference on ImageProcessing, vol. 1, Barcelona, Spain, September 14-17 2003.
[35] Z. Li, “New methods for motion estimation with applications to low complexityvideo compression,” Ph.D. Thesis, School of Electrical and Computer Engineer-ing, Purdue University, West Lafayette, IN, December 2005.
[36] Y. Liu, “Layered scalable and low complexity video encoding: New approachesand theoretic analysis,” Ph.D. Thesis, School of Electrical and Computer En-gineering, Purdue University, West Lafayette, IN, August 2004.
[37] L. Liu, “New approaches for low complexity video coding and reliable trans-mission,” preliminary report, School of Electrical and Computer Engineering,Purdue University, West Lafayette, IN, December 2005.
[38] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-correcting coding and decoding: Turbo-codes,” in International Conference onCommunications, Geneva, Switzerland, May 23-26 1993, pp. 1064–1070.
[39] C. Berrou and A. Glavieux, “Near optimium error correcting coding and de-coding: Turbo codes,” IEEE Transactions on Communications, vol. 44, no. 10,pp. 1261 – 1271, October 1996.
[40] W. E. Ryan, “Concatenated convolutional codes and iterative decoding,” inWiley Encyclopedia of Telecommunications, J. G. Proakis ed., John Wiley andSons, 2003. [Online]. Available: http://www.ece.arizona.edu/ ryan/
[41] C. Schlegel and L. Perez, Trellis and Turbo Coding. New York, NY: Wiley-IEEE Press, 2003.
[42] R. Gallager, Low-Density Parity-Check Codes. Cambridge, MA: MIT Press,1963.
114
[43] D. MacKay and R. M. Neal, “Near shannon limit performance of low densityparity check codes,” IEE Electronics Letters, vol. 32, no. 18, pp. 1645 – 1646,August 1996.
[44] D. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEETransactions on Information Theory, vol. 45, no. 2, pp. 399–431, March 1999.
[45] W. E. Ryan, “An introduction to LDPC codes,” in CRC Handbook for Codingand Signal Processing for Recording Systems, B. Vasic ed., CRC Press, 2004.[Online]. Available: http://www.ece.arizona.edu/ ryan/
[46] J. Ascenso, C. Brites, and F. Pereira, “Improving frame interpolation withspatial motion smoothing for pixel domain distributed video coding,” in 5thEURASIP Conference on Speech and Image Processing, Multimedia Communi-cations and Services, Slovak Republic, July 2005.
[47] C. Brites, J. Ascenso, and F. Pereira, “Improving transform domain wyner-zivvideo coding performance,” in Proceedings of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, Toulouse, France, May 2006.
[48] X. Artigas and L. Torres, “An approach to distributed video coding using 3Dface models,” in Proceedings of the IEEE International Conference on ImageProcessing, Atlanta, Georgia, October 8-12 2006.
[49] L. Liu and E. J. Delp, “Wyner-Ziv video coding using LDPC codes,” in Pro-ceedings of the IEEE 7th Nordic Signal Processing Symposium (NORSIG 2006),Reykjavik, Iceland, June 7-9 2005.
[50] X. Artigas and L. Torres, “Iterative generation of motion-compensated sideinformation for distributed video coding,” in Proceedings of the IEEE Interna-tional Conference on Image Processing, vol. 1, Genova, Italy, September 11-142005, pp. 833 – 836.
[51] J. Ascenso, C. Brites, and F. Pereira, “Motion compensated refinement forlow complexity pixel based distributed video coding,” in IEEE InternationalConference on Advanced Video and Signal Based Surveillance, Como, Italy,September 2005.
[52] DISCOVER, “Distributed coding for video services - application scenarios andfunctionalities for DVC,” http://www.discoverdvc.org/deliverables/Discover-D4.pdf.
[53] S. S. Pradhan and K. Ramchandran, “Distributed source coding using syn-dromes (DISCUS): design and construction,” IEEE Transactions on Informa-tion Theory, vol. 49, no. 3, pp. 626 – 643, March 2003.
[54] R. Puri and K. Ramchandran, “PRISM: a video cod-ing paradigm based on motion-compensated prediction atthe decoder,” submitted. [Online]. Available: http://www-wavelet.eecs.berkeley.edu/ rpuri/researchlinks/papers/purirvc2003.pdf.gz
[55] P. Ishwar, V. Prabhakaran, and K. Ramchandran, “Towards a theory for videocoding using distributed compression principles,” in Proceedings of the IEEE In-ternational Conference on Image Processing, vol. 2, Barcelona, Spain, Septem-ber 2003.
115
[56] A. Sehgal, A. Jagmohan, and N. Ahuja, “Wyner-Ziv coding of video: an error-resilient compression framework,” IEEE Transactions on Multimedia, vol. 6,no. 2, pp. 249 – 258, April 2004.
[57] A. Jagmohan, A. Sehgal, and N. Ahuja, “Predictive encoding using coset codes,”in Proceeding of IEEE International Conference on Image Processing, ICIP-2002, Rochester, New York, September 22-25 2002, pp. 29–32.
[58] S. Shamai, S. Verdu, and R. Zamir, “Systematic lossy source/channel coding,”IEEE Transactions on Information Theory, vol. 44, pp. 564–579, March 1998.
[59] S. Rane, A. Aaron, and B. Girod, “Systematic lossy forward error protection forerror-resilient digital video broadcasting - A Wyner-Ziv coding approach,” inProceedings of IEEE International Conference on Image Processing, ICIP-2004,Singapore, October 24-27 2004, pp. 3101–3104.
[60] ——, “Systematic lossy forward error protection for error-resilient digital videobroadcasting,” in Proceedings of the SPIE: Visual Communications and ImageProcessing, San Jose, CA, Januaray 2004.
[61] L. Liang, P. Salama, and E. J. Delp, “Unequal error protection using Wyner-Zivcoding,” in Proceedings of SPIE International Conference on Visual Commu-nications and Image Processing, San Jose, CA, January 28 - Feburary 1 2007.
[62] Q. Xu and Z. Xiong, “Layered Wyner-Ziv video coding,” in Proceedings ofthe SPIE: International Conference on Image and Video Communications andProcessing(IVCP), San Jose, CA, January 18-22 2004.
[63] Q. Xu, “Layered wyner-ziv video coding for noisy channel,” Master Thesis,Department of Electrical Engineering, Texas A&M University, College Station,TX, August 2004.
[64] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalableH.264/MPEG4-AVC extension,” in Proceedings of the IEEE International Con-ference on Image Processing, Atlanta, Georgia, October 8-12 2006.
[65] J. Prades-Nebot, G. W. Cook, and E. J. Delp, “An analysis of the efficiencyof different SNR-scalable strategies for video coders,” IEEE Transactions onImage Processing, vol. 15, no. 4, pp. 848– 864, April 2006.
[66] G. Cook, J. Prades-Nebot, Y. Liu, and E. Delp, “Rate-distortion analysis ofmotion-compensated rate scalable video,” IEEE Transactions on Image Pro-cessing, vol. 15, no. 8, pp. 2170 – 2190, August 2006.
[67] Y. Liu, P. Salama, Z. Li, and E. Delp, “An enhancement of leaky predictionlayered video coding,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 15, no. 11, pp. 1317– 1331, November 2005.
[68] G. Cook, “A study of scalability in video compression: Rate-distortion analy-sis and parallel implementation,” Ph.D. dissertation, Purdue University, WestLafayette, IN, December 2002.
[69] K. Shen and E. J. Delp, “Wavelet based rate scalable video compression,” IEEETransactions on Circuits and Systems for Video Technology, vol. 9, no. 1, pp.109 – 122, February 1999.
116
[70] J.-R. Ohm, “Advances in scalable video coding,” Proceedings of the IEEE,vol. 93, no. 1, pp. 42– 56, January 2005.
[71] J. Reichel, H. Schwarz, and M. Wien, “Joint scalable video model JSVM-7,”JVT-T202, Klagenfurt, Austria, July 2006.
[72] F. Wu, S. Li, and Y.-Q. Zhang, “A framework for efficient progressive finegranularity scalable video coding,” IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 11, no. 3, pp. 332–344, March 2001.
[73] A. R. Reibman, L. Bottou, and A. Basso, “Scalable video coding with man-aged drift,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 13, no. 2, pp. 131– 140, February 2003.
[74] A. Smolic and P. Kauff, “Interactive 3-D video representation and coding tech-nologies,” Proceedings of the IEEE, vol. 93, no. 1, pp. 98– 110, January 2005.
[75] “Introduction to multi-view video coding,” ISO/IEC JTC 1/SC 29/WG 11,Doc N7328, Poznan, Poland, July 2005.
[76] K. Muler, P. Merkle, H. Schwarz, T. Hinz, A. Smolic, and T. Wiegand, “Multi-view video coding based on H.264/AVC using hierarchical B-frames,” in Pro-ceedings of Picture Coding Symposium (PCS), Beijing, China, April 2006.
[77] E. Martinian, A. Behrens, J. Xin, A. Vetro, and H. Sun, “Extensions ofH.264/AVC for multiview video compression,” in Proceedings of the IEEE In-ternational Conference on Image Processing, Atlanta, Georgia, October 8-122006.
[78] A. Vetro, Y. Su, H. Kimata, and A. Smolic, “Joint multiview video model(JMVM) 2.0,” JVT-U207, Hangzhou, China, October 2006.
[79] X. Guo, Y. Lu, F. Wu, and W. Gao, “Inter-view direct mode for multiview videocoding,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 16, no. 12, pp. 1527–1532, December 2006.
[80] E. Martinian, A. Behrens, J. Xin, and A. Vetro, “View synthesis for multi-view video compression,” in Proceedings of Picture Coding Symposium (PCS),Beijing, China, April 2006.
[81] J. Garbas, U. Fecker, T. Trger, and A. Kaup, “4D scalable multi-view video cod-ing using disparity compensated view filtering and motion compensated tempo-ral filtering,” in Proceedings of the IEEE International Workshop on MultimediaSignal Processing (MMSP), Victoria, Canada, October 2006.
[82] X. Guo, Y. Lu, F. Wu, W. Gao, and S. Li, “Distributed multi-view videocoding,” in Proceedings of the SPIE International Conference on Visual Com-munications and Image Processing, San Jose, California, January 2006.
[83] L. Liu, Y. Liu, and E. J. Delp, “Network-driven Wyner-Ziv video coding usingforward prediction,” in Proceedings of the SPIE International Conference onImage and Video Communications and Processing (IVCP), San Jose, California,January 2005, pp. 641–651.
117
[84] L. Liu, P. Sabria, L. Torres, and E. J. Delp, “Error resilience in network-drivenWyner-Ziv video coding,” in Proceedings of the SPIE International Conferenceon Visual Communications and Image Processing, San Jose, California, January2006.
[85] L. Liu, Z. Li, and E. J. Delp, “Backward channel aware Wyner-Ziv video cod-ing,” in Proceedings of the IEEE International Conference on Image Processing,Atlanta, Georgia, October 8-12 2006, pp. 1677–1680.
[86] ——, “Complexity-constrained rated-distortion optimization for Wyner-Zivvideo coding,” in Proceedings of the SPIE International Conference on Mul-timedia on Mobile Devices, San Jose, California, January 28 - February 1 2007.
[87] L. Liu, F. Zhu, M. Bosch, and E. J. Delp, “Recent advances in video com-pression:whats next?” in Proceedings of the International Symposium on Sig-nal Processing and its Applications, Sharjah, United Arab Emirates (U.A.E.),February 2007.
[88] L. Liu, Z. Li, and E. J. Delp, “Complexity-rate-distortion analysis of backwardchannel aware wyner-ziv video coding,” in Proceedings of the IEEE Interna-tional Conference on Image Processing, San Antonio, TX, September 2007.
[89] L. Liu, F. Zhu, M. Bosch, and E. J. Delp, “Recent advances in video compres-sion: what’s next?” in book chapter in Statistical Science and InterdisciplinaryResearch: Pattern Recognition, Image Processing, and Video Processing, Eds.Bhabatosh Chanda, et al, World Scientific Press 2007.
[90] Z. Li, L. Liu, and E. J. Delp, “Rate distortion analysis of motion side estimationin Wyner-Ziv video coding,” IEEE Transactions on Image Processing, vol. 16,pp. 98–113, January 2007.
[91] ——, “Wyner-Ziv video coding with universal prediction,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 16, no. 11, pp. 1430–1436,November 2006.
[92] ——, “Wyner-Ziv video coding with universal prediction,” in Proceedings of theSPIE International Conference on Visual Communications and Image Process-ing, San Jose, CA, January 2006.
[93] ——, “Wyner-Ziv video coding: A motion estimation perspective,” in Pro-ceedings of the SPIE International Conference on Visual Communications andImage Processing, San Jose, CA, January 2006.
[94] S. Verdu, “Fifty years of shannon theory,” IEEE Transactions on InformationTheory, vol. IT-44, pp. 2057–2078, October 1998.
[95] S. S. Pradhan, J. Chou, and K. Ramchandran, “Duality between source cod-ing and channel coding and its extension to the side information case,” IEEETransactions on Information Theory, vol. 49, no. 5, pp. 1181 – 1203, May 2003.
[96] M. Costa, “Writing on dirty paper,” IEEE Transactions on Information Theory,vol. 29, no. 3, pp. 439– 441, May 1983.
[97] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York,NY: Wiley series in telecommunications, 1991.
118
[98] A. Trapanese, M. Tagliasacchi, S. Tubaro, J. Ascenso, C. Brites, and F. Pereira,“Improved correlation noise statistics modeling in frame-based pixel domainwyner-ziv video coding,” in International Workshop on Very Low Bitrate VideoCoding, Sardinia, Italy, September 2005.
[99] R. Westerlaken, R. K. Gunnewiek, and R. Lagendijk, “The role of the virtualchannel in distributed source coding of video,” in Proceedings of the IEEE In-ternational Conference on Image Processing, Genova, Italy, September 2005,pp. 581 – 584.
[100] M. Dalai, R. Leonardi, and F. Pereira, “Improving turbo codec integrationin pixel-domain distributed video coding,” in Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing, vol. 2, Toulouse,France, May 14-19 2006.
[101] V. Sheinin, A. Jagmohan, and D. He, “Uniform threshold scalr quantizer per-formance in Wyner-Ziv coding with memoryless, additive laplacian correlationchannel,” in Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, vol. 4, Toulouse, France, May 14-19 2006.
[102] C. Brites, J. Ascenso, and F. Pereira, “Feedback channel in pixel domain Wyner-Ziv video coding: Myths and realities,” in Proceedings of the 14th EuropeanSignal Processing Conference, Florence, Italy, September 2006.
[103] M. Tagliasacchi, A. Trapanese, S. Tubaro, J. Ascenso, C. Brites, and F. Pereira,“Exploiting spatial redundancy in pixel domain Wyner-Ziv video coding,” inProceedings of the IEEE International Conference on Image Processing, At-lanta, Georgia, October 8-12 2006.
[104] A. Aaron, S. Rane, and B. Girod, “Wyner-Ziv video coding with hash-basedmotion compensation at the receiver,” in Proceedings of the IEEE InternationalConference on Image Processing, Singapore, October 2004.
[105] A. Trapanese, M. Tagliasacchi, S. Tubaro, J. Ascenso, C. Brites, and F. Pereira,“Embedding a block-based intra mode in frame-based pixel domain Wyner-Zivvideo coding,” in International Workshop on Very Low Bitrate Video (VLBV2005), Sardinia, Italy, September 15-16 2005.
[106] M. Tagliasacchi, A. Trapanese, S. Tubaro, J. Ascenso, C. Brites, and F. Pereira,“Intra mode decision based on spatio-temporal cues in pixel domain Wyner-Ziv video coding,” in Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing, vol. 2, Toulouse, France, May 14-192006.
[107] A. Aaron, D. Varodayan, and B. Girod, “Wyner-Ziv residual coding of video,”in Proceedings of Picture Coding Symposium (PCS), Beijing, China, April 2006.
[108] [Online]. Available: http://www.eccpage.com/
[109] D. MacKay. [Online]. Available: http://www.inference.phy.cam.ac.uk/mackay/
[110] Z. Li and E. J. Delp, “Wyner-Ziv video side estimator: conventional motionsearch methods revisited,” in Proceedings of the IEEE International Conferenceon Image Processing, vol. 1, Genova, Italy, September 11-14 2005, pp. 825 – 828.
119
[111] T. Berger, Rate distortion theory: A mathematical basis for data compression.Prentice-Hall, 1971.
[112] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Infor-mation Theory, vol. 44, no. 6, pp. 2124–2147, October 1998.
[113] N. Merhav and J. Ziv, “On the Wyner-Ziv problem for individual sequences,”IEEE Transactions on Information Theory, vol. 52, no. 3, pp. 867– 873, March2006.
[114] W. B. Rabiner and A. P. Chandrakasan, “Network-driven motion estimationfor wireless video terminals,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 7, no. 4, pp. 644–653, August 1997.
[115] Y. Wang, S. Wenger, J. Wen, and A. K. Katsaggelos, “Error resilient videocoding techniques,” IEEE Signal Processing Magazine, vol. 17, no. 4, pp. 61–82, July 2000.
[116] Y. Wang and Q. Zhu, “Error control and concealment for video communication:a review,” Proceedings of the IEEE, vol. 86, no. 5, pp. 974–997, May 1998.
[117] T. Stockhammer, M. Hannuksela, and T. Wiegand, “H.264/AVC in wireless en-vironments,” IEEE Transactions on Circuits and Systems for Video Technology,vol. 13, no. 7, pp. 657– 673, July 2003.
[118] W.-Y. Kung, C.-S. Kim, and C.-C. Kuo, “Spatial and temporal error conceal-ment techniques for video transmission over noisy channels,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 16, no. 7, pp. 789– 803, July2006.
[119] JM8.0, http://iphome.hhi.de/suehring/tml/.
[120] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysisfor wireless video communication under energy constraints,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 645– 658, May2005.
[121] H. F. Ates, B. Kanberoglu, and Y. Altunbasak, “Rate-distortion and complex-ity joint optimization for fast motion estimatin in H.264 video coding,” in Pro-ceedings of the IEEE International Conference on Image Processing, Atlanta,Georgia, October 8-12 2006.
[122] D. S. Turaga, M. van der Schaar, and B. Pesquet-Popescu, “Complexity scalablemotion compensated wavelet video encoding,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 15, no. 8, pp. 982– 993, August 2005.
[123] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, “H.264/AVC base-line profile decoder complexity analysis,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 13, no. 7, pp. 704–716, Jul. 2003.
[124] M. van der Schaar and Y. Andreopoulos, “Rate-distortion-complexity modelingfor network and receiver aware adaptation,” IEEE Transactions on Multimedia,vol. 7, no. 3, pp. 471– 479, June 2005.
120
[125] Q. Zhang, Z. Ji, W. Zhu, and Y.-Q. Zhang, “Power-minimized bit allocation forvideo communication over wireless channels,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 12, no. 6, pp. 398–410, Jun. 2002.
[126] V. Lappalainen, A. Hallapuro, and T. Hamalainen, “Complexity of optimizedH.26L video decoder implementation,” IEEE Transactions on Circuits and Sys-tems for Video Technology, vol. 13, no. 7, pp. 717–725, Jul. 2003.
[127] S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performanceand complexity co-evaluation of the advanced video coding standard for cost-effective multimedia communications,” EURASIP Journal on Applied SignalProcessing, vol. 2004, no. 2, pp. 220–235, 2004.
[128] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp.910–927, September 1992.
[129] B. Girod, “The efficiency of motion-compensating prediction for hybrid codingof video sequences,” IEEE Journal on Selected Areas in Communications, vol. 5,no. 7, pp. 1140–1154, August 1987.
[130] E. J. Gumbel, Statistics of Extremes. New York: Columbia University Press,1958.
[131] L. Milstein, D. Schilling, and J. Wolf, “Robust detection using extreme-valuetheory,” IEEE Transactions on Information Theory, vol. 15, no. 3, pp. 370–375, May 1969.
[132] H. Blume, G. Herczeg, O. Erdler, and T. Noll, “Object based refinement ofmotion vector fields applying probabilistic homogenization rules,” IEEE Trans-actions on Consumer Electronics, vol. 48, no. 3, pp. 694 – 701, August 2002.
[133] G. Ding and B. Li, “Motion vector estimation using line-square search blockmatching algorithm for video sequences,” EURASIP Journal on Applied SignalProcessing, vol. 2004, no. 1, pp. 1750 – 1756, 2004.
[134] A. Papoulis, Probability, Random Variables and Stochastic Processes. McGraw-Hill Companies, 1991.
VITA
121
VITA
Limin Liu was born in Fujian, P. R. China. She earned her B.E. degree in electrical
engineering in 2001 from Tsinghua University, Beijing, China.
From Fall 2001 to Fall 2002, she studied in Klipsch School of Electrical and Com-
puter Engineering at New Mexico State University. Since Spring 2003, she has been
pursuing her Ph.D. degree in the School of Electrical and Computer Engineering at
Purdue University, West Lafayette, Indianan. She has been a research assistant in the
Video and Image Processing Laboratory (VIPER) under the supervision of Professor
Edward J. Delp. Her research work has been funded by a grant from the Indiana
Twenty-First Century Research and Technology Fund. Her current research interests
include image and video compression (JPEG/JPEG2000/MPEG-2/H.263/MPEG-
4/H.264/AVC), image and video post processing and multimedia communications.
In the summer of 2005, she worked as a summer intern at Thomson Corporate
Research, Indianapolis, Indiana. She worked on the design and development of Ad-
vanced 4:4:4 Profile, which was adopted by the Joint Video Team (JVT) and became
a new profile in H.264. In the summer of 2006, she worked at Sun Microsystems Lab-
oratories, Menlo Park, California, as a summer intern. In the summer of 2007, she
worked as an intern at IBM Thomas J. Watson Research Center, Yorktown Heights,
NY on Wyner-Ziv coding with particular interest on the mode decision in Wyner-Ziv
video coding.
Limin Liu is a student member of the IEEE professional society and Women in
Engineering. Limin Liu is a member of the Eta Kappa Nu (HKN) national electrical
engineering honor society.
122
Limin Liu’s publications for her research work include:
Book Chapters:
1. Limin Liu, Fengqing Zhu, Marc Bosch, and Edward J. Delp, “Recent Advances
in Video Compression: What’s Next? ” book chapter in Statistical Science and
Interdisciplinary Research: Pattern Recognition, Image Processing and Video
Processing, Eds. Bhabatosh Chanda, et al, World Scientific Press, 2007
Journal papers:
1. Limin Liu, Zhen Li, and Edward J. Delp, “Video Surveillance Using Backward
Channel Aware Wyner-Ziv Video Coding,” submitted to IEEE Transactions on
Circuits and Systems for Video Technology.
2. Zhen Li, Limin Liu, and Edward J. Delp, “Rate Distortion Analysis of Motion
Side Estimation in Wyner-Ziv Video Coding ” IEEE Transactions on Image
Processing, Vol. 16, No. 1, January 2007, pp.98-113.
3. Zhen Li, Limin Liu, and Edward J. Delp, “Wyner-Ziv Video Coding with
Universal Prediction,” IEEE Transactions on Circuits and Systems for Video
Technology, Vo. 16, No. 11, November 2006, pp.1430-1436.
Conference papers:
1. Limin Liu, Yuxin Liu, and Edward J. Delp, “Enhanced Intra Prediction Using
Context-Adaptive Linear Prediction, ” accepted by the Picture Coding Sympo-
sium, November, 2007, Lisbon, Portugal.
2. Limin Liu, Zhen Li, and Edward J. Delp, “Complexity-Rate-Distortion Anal-
ysis of Backward Channel Aware Wyner-Ziv Video Coding, ” Proceedings of the
IEEE International Conference on Image Processing, September 16-19, 2007,
San Antonio, TX.
3. Limin Liu, Fengqing Zhu, Marc Bosch, and Edward J. Delp, “Recent Advances
in Video Compression: What’s Next? ” Proceedings of the IEEE International
Symposium on Signal Processing and its Applications, February 12-15, 2007,
Sharjah, United Arab Emirates (Plenary Paper).
123
4. Limin Liu, Yuxin Liu, and Edward J. Delp, “Content-adaptive Motion Es-
timation for Efficient Video Compression, ” Proceedings of the SPIE Interna-
tional Conference on Video Communications and Image Processing, January 28
- February 1, 2007, San Jose, CA.
5. Limin Liu, Zhen Li, and Edward J. Delp, “Complexity-constrained Rate-distortion
Optimization for Wyner-Ziv Video Coding, ” Proceedings of the SPIE Interna-
tional Conference on Multimedia on Mobile Devices, January 28 - February 1,
2007, San Jose, CA.
6. Limin Liu, Zhen Li, and Edward J. Delp, “Backward Channel Aware Wyner-
Ziv Video Coding, ” Proceedings of the IEEE International Conference on Image
Processing, October 8-11, 2006, Atlanta, GA, pp. 1677-1680.
7. Limin Liu and Edward J. Delp, “Wyner-Ziv Video Coding Using LDPC Codes,
” Proceedings of the IEEE 7th Nordic Signal Processing Symposium, June 7-9,
2006, Reykjavik, Iceland.
8. Limin Liu, Pau Sabria, Luis Torres, and Edward J. Delp, “Error Resilience
in Network-Driven Wyner-Ziv Video Coding, ” Proceedings of the SPIE Inter-
national Conference on Video Communications and Image Processing, January
15-19, 2006, San Jose, CA.
9. Zhen Li, Limin Liu, and Edward J. Delp, “Wyner-Ziv Video Coding with Uni-
versal Prediction,” Proceedings of the SPIE International Conference on Video
Communications and Image Processing, January 15-19, 2006, San Jose, CA.
10. Zhen Li, Limin Liu, and Edward J. Delp, “Wyner-Ziv Video Coding: A Motion
Estimation Perspective,” Proceedings of the SPIE International Conference on
Video Communications and Image Processing, January 15-19, 2006, San Jose,
CA.
11. Limin Liu, Yuxin Liu, and Edward J. Delp, “Network-driven Wyner-Ziv Video
Coding Using Forward Prediction,” Proceedings of the SPIE International Con-
124
ference on Image and Video Communications and Processing, January 16-20,
2005, San Jose, CA, Vol. 5685, pp.641-651.