16
Low delay distributed video coding with refined side information Anto ´ nio Tome ´ , Fernando Pereira n,# Instituto Superior Te´cnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal article info Article history: Received 6 May 2010 Accepted 3 January 2011 Available online 1 March 2011 Keywords: Distributed video coding Side information Low delay Correlation noise modeling Iterative refinement abstract Distributed video coding (DVC) is a new video coding paradigm based upon two fundamental theoretical results: the Slepian–Wolf and Wyner–Ziv theorems. Among other benefits, this new coding paradigm may allow a flexible complexity allocation between the encoder and the decoder. Several DVC codecs have been developed over the years addressing the specific requirements of emerging applications such as wireless video surveillance and sensor networks. While state-of-the-art DVC codecs, such as the DIS- COVER DVC codec, have shown promising RD performance, most DVC codecs in the literature do not consider low delay requirements which are relevant for some of the addressed applications. In this context, the main objective and novelty of this paper is to propose an efficient, low delay and fully practical DVC codec based on the Stanford DVC architecture adopting a side information iterative refinement approach. The obtained performance results show that the developed DVC solution fulfils the objectives regarding relevant benchmarks, notably due to the novel side information creation and correlation noise modeling tools integrated in a side information iterative refinement framework. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Video coding technologies have been playing an impor- tant role in the context of audiovisual services such as digital TV, mobile video, and Internet streaming to cope with the ever growing compression requirements needs. Most available video coding standards, notably the ITU-T H.26x and ISO/IEC MPEG-x families of standards, adopt the so-called predictive video coding paradigm where the temporal and spatial correlations are exploited at the encoder by using a motion compensated prediction loop and a spatial transform, respectively. As a consequence, this video coding paradigm typically leads to rather com- plex encoders and much simpler decoders, with a rigid allocation of the complexity between the transmitter and the receiver. This approach fits well some application scenarios, e.g. broadcasting, where a few (complex) enco- ders provide coded content for millions of (simpler) decoders. However, with the explosion of mobile and wireless networks, there are a growing number of applica- tions where many senders deliver data to a central receiver, e.g. video surveillance and sensor networks. Typically, these emerging applications require light encod- ing complexity, high compression efficiency, robustness to packet losses and, often, also low latency/delay. To address these emerging requirements, some research groups revis- ited the video coding problem at the light of an Informa- tion Theory result from the 1970s: the Slepian–Wolf theorem [1]. According to this theorem, the minimum rate needed to independently encode two statistically depen- dent discrete random sequences, X and Y, is the same as for joint encoding. While the Slepian–Wolf theorem deals with lossless coding, in 1976, Wyner and Ziv studied the case of lossy coding with side information (SI) at the decoder. Under some hypothesis on the joint statistics, the Wyner–Ziv theorem [2] states that, when the side information (i.e. the correlated source Y) is made available only at the decoder, there is no coding efficiency loss in encoding X, with respect to the case when joint encoding of X and Y is performed. In summary, the Slepian–Wolf and the Wyner–Ziv theorems state that it is possible to encode two statistically dependent signals independently and Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/image Signal Processing: Image Communication 0923-5965/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2011.01.005 n Corresponding author. E-mail address: [email protected] (F. Pereira). # EURASIP member. Signal Processing: Image Communication 26 (2011) 220–235

Low delay distributed video coding with refined side information

Embed Size (px)

Citation preview

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication 26 (2011) 220–235

0923-59

doi:10.1

n Corr

E-m# E

journal homepage: www.elsevier.com/locate/image

Low delay distributed video coding with refined side information

Antonio Tome, Fernando Pereira n,#

Instituto Superior Tecnico Av. Rovisco Pais, 1049-001 Lisboa, Portugal

a r t i c l e i n f o

Article history:

Received 6 May 2010

Accepted 3 January 2011Available online 1 March 2011

Keywords:

Distributed video coding

Side information

Low delay

Correlation noise modeling

Iterative refinement

65/$ - see front matter & 2011 Elsevier B.V. A

016/j.image.2011.01.005

esponding author.

ail address: [email protected] (F. Pereira).

URASIP member.

a b s t r a c t

Distributed video coding (DVC) is a new video coding paradigm based upon two

fundamental theoretical results: the Slepian–Wolf and Wyner–Ziv theorems. Among other

benefits, this new coding paradigm may allow a flexible complexity allocation between the

encoder and the decoder. Several DVC codecs have been developed over the years

addressing the specific requirements of emerging applications such as wireless video

surveillance and sensor networks. While state-of-the-art DVC codecs, such as the DIS-

COVER DVC codec, have shown promising RD performance, most DVC codecs in the

literature do not consider low delay requirements which are relevant for some of the

addressed applications. In this context, the main objective and novelty of this paper is to

propose an efficient, low delay and fully practical DVC codec based on the Stanford DVC

architecture adopting a side information iterative refinement approach. The obtained

performance results show that the developed DVC solution fulfils the objectives regarding

relevant benchmarks, notably due to the novel side information creation and correlation

noise modeling tools integrated in a side information iterative refinement framework.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Video coding technologies have been playing an impor-tant role in the context of audiovisual services such asdigital TV, mobile video, and Internet streaming to copewith the ever growing compression requirements needs.Most available video coding standards, notably the ITU-TH.26x and ISO/IEC MPEG-x families of standards, adopt theso-called predictive video coding paradigm where thetemporal and spatial correlations are exploited at theencoder by using a motion compensated prediction loopand a spatial transform, respectively. As a consequence,this video coding paradigm typically leads to rather com-plex encoders and much simpler decoders, with a rigidallocation of the complexity between the transmitter andthe receiver. This approach fits well some applicationscenarios, e.g. broadcasting, where a few (complex) enco-ders provide coded content for millions of (simpler)decoders. However, with the explosion of mobile and

ll rights reserved.

wireless networks, there are a growing number of applica-tions where many senders deliver data to a centralreceiver, e.g. video surveillance and sensor networks.Typically, these emerging applications require light encod-ing complexity, high compression efficiency, robustness topacket losses and, often, also low latency/delay. To addressthese emerging requirements, some research groups revis-ited the video coding problem at the light of an Informa-tion Theory result from the 1970s: the Slepian–Wolftheorem [1]. According to this theorem, the minimum rateneeded to independently encode two statistically depen-dent discrete random sequences, X and Y, is the same as forjoint encoding. While the Slepian–Wolf theorem dealswith lossless coding, in 1976, Wyner and Ziv studied thecase of lossy coding with side information (SI) at thedecoder. Under some hypothesis on the joint statistics,the Wyner–Ziv theorem [2] states that, when the sideinformation (i.e. the correlated source Y) is made availableonly at the decoder, there is no coding efficiency loss inencoding X, with respect to the case when joint encodingof X and Y is performed. In summary, the Slepian–Wolf andthe Wyner–Ziv theorems state that it is possible to encodetwo statistically dependent signals independently and

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 221

decoding them jointly, while approaching the codingefficiency of conventional predictive coding schemes,which rely on joint encoding and decoding instead. Thenew coding paradigm, known as distributed video coding(DVC) does not rely on joint encoding and thus, whenapplied to video coding, it typically results on the absenceof the temporal prediction loop (always used in predictiveschemes) and lower complexity encoders. DVC architec-tures may provide the following functional benefits whichare rather important for many emerging applications: (i)flexible allocation of the global video codec complexity; (ii)improved error resilience; (iii) codec independent scalabil-ity (since upper layers do not have to rely on precise lowerlayers); and (iv) exploitation of multiview correlationwithout cameras/encoders communicating among them.The functional benefits above can be relevant for a largerange of emerging application scenarios such as wirelessvideo cameras, low-power surveillance, video conferencingwith mobile devices, disposable video cameras, visualsensor networks, distributed video streaming, multiviewvideo systems, and wireless capsule endoscopy [3]. For areview on DVC basics and advances, please read [4–6].

Based on these theoretical results, the practical designof Wyner–Ziv (WZ) video codecs, a particular case of DVC,started around 2002, following important developmentsin channel coding technology. The first practical WZsolutions have been developed at Stanford University[4,7,8] and UC Berkeley [9,10]. As of today, the mostpopular WZ video codec design in the literature is clearlythe Stanford architecture, which works at the framelevel and is characterized by a feedback channel baseddecoder rate control. On the other hand, the Berkeleyarchitecture, known as Power-efficient, Robust, hIgh com-pression Syndrome based Multimedia coding (PRISM),works at the block level and is characterized by anencoder side rate control approach based on the avail-ability of a reference frame at the decoder. Regarding theStanford DVC solutions, the side information generationprocess strongly impacts the overall RD performance butalso the algorithmic delay, depending if an interpolationor extrapolation-based side information creation solutionis used. In the context of DVC codecs, side informationrefers to the estimation made at the decoder, based on thealready available decoded frames, for the frame to be DVCcoded. In most available DVC solutions, the side informa-tion creation process is performed using an interpolation-based approach since it corresponds to estimating a framebetween two available frames, one in the past andanother in future; since one of the reference frames forthe interpolation is in the future, this solution impliesalgorithmic delay, like when using B frames in predictivevideo coding, which may not be acceptable for someapplications. For the applications requiring low delay, itis possible to adopt an extrapolation-based side informa-tion creation solution where the estimation of the sideinformation is made by projecting decoded frames fromthe past to the future, without requiring the availability offuture frames, and thus avoiding the algorithmic delay.

While there are in the literature many examples ofpractical and realistic interpolation-based SI DVC codecs,which have an associated delay penalty, the same does

not happen for extrapolation-based low delay DVC codecswhich are rare and typically adopt unpractical assump-tions, e.g. the availability of originals at the decoder togenerate the side information or to drive the requeststopping criterion.

In this context, the objective of this paper is to developa novel DVC solution based on the Stanford architecture,fulfilling two main requirements: high efficiency and lowdelay. To achieve this objective with a practical, realisticDVC architecture, this paper proposes efficient, extrapola-tion-based side information generation and adaptive cor-relation noise modeling solutions integrated in an iterativerefinement approach. As far as the authors know, thisarchitectural design has never been proposed and assessedin the literature. As it will be shown, the associated RDperformance is promising both in comparison with stan-dard-based solutions as well as state-of-the-art DVC inter-polation-based solutions, notably for video content relatedto the most relevant application scenarios.

The following sections are organized as follows.Section 2 provides a classification system for the DVCsolutions, and also reviews the most relevant low delayDVC codecs in the literature. Section 3 gives detailedinformation on the first version of the codec developedin this paper, the Advance Low Delay DVC (ALD-DVC)codec. In order to achieve a better RD performance, a sideinformation refinement module is next integrated into theALD-DVC codec, as described in Section 4. Section 5presents the performance evaluation for the two proposedlow delay DVC codecs. Finally, Section 6 includes theconclusions and future work.

2. Classification and background

The DVC spectrum of solutions is wide and, thus, someorganization of this landscape would be welcome for abetter understanding of the similarities, complementari-ties and alternatives regarding possible DVC solutions.This requires the definition of some relevant classificationdimensions which do not have to be unique, but maycertainly help understanding the relations between thevarious types of possible DVC solutions. Fig. 1 shows theDVC classification tree proposed in this paper which isbased on four structuring dimensions; the presentation ofthe tree is simplified in order to facilitate the readingprocess, i.e. the classification tree under the Multiview side

is exactly the same as under the Monoview side, thus onlythe Monoview side is showed in full detail. The proposedDVC classification dimensions are:

1.

Number of camera views: Depending on the number ofviews to code, DVC solutions may be classified asMonoview or Multiview.

2.

Delay: Regarding the coding delay, DVC solutions mayfulfill or not low (algorithmic) delay requirements.

3.

Basic spatial coding support: Regarding the type of basicspatial coding support, DVC solutions may typically beframe or block-based.

4.

Feedback channel: Regarding the exploitation of afeedback channel when available, which is possible

Fig. 1. Proposed classification tree for DVC solutions.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235222

for real-time applications, DVC solutions may do itor not.

The proposed classification tree allows structuring thepanoplia of DVC solutions meaning that each DVC solu-tion should fit in a leaf of the designed tree. This paperwill propose in the next sections technical advances interms of a specific leaf in the classification tree proposed,notably monoview, low delay, frame-based and feedbackchannel enabled DVC solutions.

As described in the previous section, the first DVCsolutions, developed around 2002, were the Stanford DVCand PRISM DVC codecs, which led to a number of devel-opments, resulting in many improved DVC solutions.Most of these enhanced DVC codecs have been based onthe Stanford DVC basic architecture, namely the state-of-the-art interpolation-based SI DISCOVER DVC codec [11],which uses many powerful tools to achieve better RDperformance results. The DISCOVER DVC codec is a frame-based, transform domain codec which online estimatesthe statistics of the correlation noise between the WZframe and the corresponding SI information, making it apractical codec without unrealistic assumptions. How-ever, as it uses interpolation-based tools to perform themotion estimation at the decoder, delay is an issue sincethe interpolation process requires future frames to beavailable, thus adding unwelcomed delay.

Although some low delay Stanford based DVC solutionshave been developed over the years, the extrapolation-based SI techniques used do not allow them to reach thesame level of RD performance obtained with interpolation-based SI creation processes. Among the most relevant lowdelay Stanford-based DVC solutions is the Low Delay ISTDVC codec developed by Natario et al. [12] which proposesa rather elaborate low delay side information generationprocess. In [13], Weerakkody et al. propose an evolution ofthe Low Delay IST DVC codec, performing the refinementof the side information frame, combined with two othertools presented in [14,15], namely the creation and use oftwo side information streams and a sequential motion

estimation process based on the luminance and chrominancecomponents. Although the performance results may lookpromising, some of the test conditions are unrealistic, unfairor unclear. For example, the Low Delay IST DVC codec stilluses original frames at the decoder for the side informationcreation process and results are only presented for the initial100 frames of the Foreman sequence, accounting only for therate spent in the WZ frames. As for the test conditions in [13],they follow the same unrealistic pattern. In order to boost theRD performance of codecs addressing low delay require-ments, some extrapolation based techniques have beendeveloped by Borchert et al. in [16,17]. These techniquesuse three past frames to create a three frames motion vectorfield; however, the details regarding how to generate thismotion vector field are not in the papers. Other algorithmshave been proposed, such as the 3-D Recursive Search [18]and the Content Adaptive Resolution [19], but they sufferfrom the same problem.

In the context above, the target of this paper is topropose a novel, efficient, monoview, low delay, frame-based and feedback channel enabled DVC solution with-out unpractical ties and assess its performance underprecise and relevant test conditions and in comparisonwith the appropriate benchmarks.

3. Advanced low delay IST DVC codec

The Advanced Low Delay DVC codec (ALD-DVC) proposedin this paper is a Stanford based DVC solution, thus followingthe early Stanford DVC architecture mentioned before, as alsothe DISCOVER DVC and Low Delay IST DVC codecs [11,12].The proposed ALD-DVC solution eliminates all assumptionsmade in available low delay DVC codecs considered to beunrealistic, e.g. use of original frames at the decoder, thuscreating a fully practical low delay DVC solution. Thearchitecture, detailed in Fig. 2, shows the various blocksconstituting the ALD-DVC codec. This section will concen-trate on the technical novelties regarding the available DVCcodecs, this means the Side Information Creation and theCorrelation Noise Modeling modules, as they are the major

Fig. 2. ALD-DVC architecture.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 223

technical advances regarding previous low delay DVC codecs,contributing for an efficient, practical, and low delay drivenDVC codec, i.e. to fulfill the objectives of this paper. Thus, theother modules in the architecture will not be described asthey follow the well known behavior of Stanford based DVCsolutions. For further information regarding all these pro-cesses, please read [5,11].

A common remark about DVC solutions including afeedback channel is that the usage of the feedback channelis a weakness by itself. Although the usage of a feedbackchannel is not realistic for applications that are intrinsi-cally unidirectional, and thus it is there an impossibility, itsusage for bidirectional applications is clearly possible and,in fact, wise if it allows to provide better overall codecs, e.g.more efficient. This reasoning is clearly more relevant inthe context of DVC since the decoder plays there a morecentral role, e.g. in terms of motion estimation, and thus itis rather natural that the same happens also for ratecontrol, whenever possible. Not using the feedback chan-nel when it is available and can bring benefits does notlook wise, very much like not using the characteristics ofthe human visual system when it is known that thedecoded video frames are to be seen by a human withcertain well known characteristics. Naturally, the usage ofthe feedback channel in some DVC architectures impliesthe rate control is performed by the decoder and, thus, it isonly possible for real-time applications, e.g. video surveil-lance, wireless video cameras, and video conferencing withmobile devices.

In summary, the proposed ALD-DVC codec works asfollows:

At the encoder

1.

Splitter or frame classification: First, a video sequence isdivided into WZ frames, this means the frames thatwill be coded using a WZ approach (as described in thenext bullets), and key frames that will be coded asIntra frames, e.g. using the H.264/AVC Intra codec; keyframes are typically periodically inserted with a cer-tain GOP size.

2.

Discrete cosine transform: Over each WZ frame, aninteger 4�4 block-based DCT is applied. The DCT

coefficients of the entire WZ frame are then groupedtogether, according to the position occupied by eachDCT coefficient within the 4�4 blocks, forming theDCT coefficients bands.

3.

Quantization: After the transform coding operation,each DCT coefficients band is uniformly quantized.Over the resulting quantized symbol stream (asso-ciated to a DCT coefficients band), bitplane extractionis performed. For a given band, the quantized symbolsbits of the same significance (e.g. the most significantbit) are grouped together, forming the correspondingbitplane array which is then independently turboencoded.

4.

Turbo encoding: The turbo encoding procedure for eachDCT coefficients band starts with the most significantbitplane array, which corresponds to the most signifi-cant bits of the quantized symbols for each band. Theparity information generated by the turbo encoder foreach bitplane is then stored in a buffer and sent inchunks/packets upon decoder request, through thefeedback channel.

At the decoder:

1.

Side information creation: The decoder creates the sideinformation for each WZ coded frame with an extra-polation-based approach as proposed in Section 3.1.The side information for each WZ frame correspondsto an estimation of the original WZ frame; the betteris the quality of this estimation, the smaller is thenumber of ‘errors’ the WZ turbo decoder has tocorrect and the bitrate necessary for successfuldecoding (i.e. with a small error probability).

2.

DCT estimation: A block-based 4�4 DCT is thencarried out over the side information in order toobtain the DCT coefficients which are an estimate ofthe WZ frame DCT coefficients.

3.

Correlation noise modeling: The residual statisticsbetween corresponding WZ frame DCT coefficientsand the side information DCT coefficients is assumedto be modeled by a Laplacian distribution.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235224

The correlation noise modeling is performed as pro-posed in Section 3.2.

4.

Turbo decoding: Once the DCT transformed side infor-mation and the residual statistics for a given DCTcoefficients band are known, the decoded quantizedsymbol stream associated to the DCT band can beobtained through the turbo decoding procedure. Theturbo decoder receives from the encoder successivechunks of parity bits following the requests madethrough the feedback channel.

5.

Fig. 3. Side information creation module architecture.

Request stopping criterion: To decide whether or notmore bits are needed for the successful decoding of acertain bitplane, the decoder uses a simple requeststopping criterion, i.e. checks that all turbo codeparity-check equations are fulfilled for the decoded(hard decision) codeword. If no more bits are neededto decode the bitplane, the decoding of the nextbitplane or band can start; otherwise, the bitplaneturbo decoding task has to proceed with anotherrequest and receive another chunk of parity bits forthe same bitplane.

6.

Further turbo decoding: After successfully turbo decod-ing the most significant bitplane array of a DCT band,the turbo decoder proceeds in an analogous way to theremaining bitplanes associated to that band. Once allthe bitplane arrays of a DCT coefficients band aresuccessfully turbo decoded, the turbo decoder startsdecoding the next band. This procedure is repeateduntil all the DCT coefficients bands for which WZ bitsare transmitted are turbo decoded.

7.

CRC checking: Because some residual errors are lefteven after the stopping criterion is fulfilled and mayhave a rather negative subjective impact on thedecoded frame quality, a CRC check sum is trans-mitted to help the decoder detect and correct theremaining errors in each bitplane. A CRC-8 check sumfor each bitplane was found to be strong enough forthis purpose which only adds minimal extra rate (8bits per bitplane).

8.

Symbol assembling: After turbo decoding all the bit-planes associated to a DCT band, the bitplanes aregrouped together to form the decoded quantizedsymbol stream associated to that band. This proce-dure is performed over all the DCT coefficients bandsfor which WZ bits are transmitted. The DCT coeffi-cients bands for which no WZ bits were transmittedare replaced by the corresponding DCT bands fromthe DCT side information.

9.

Reconstruction: Once all quantized symbol streamsare obtained, it is possible to reconstruct the matrix ofdecoded DCT coefficients for each block.

10.

IDCT: After, a block-based 4�4 IDCT is performed andthe reconstructed pixel domain WZ frame is obtained.

11.

Frame remixing: To, finally, get the decoded videosequence, decoded key frames and WZ frames aremixed conveniently.

The next sections will describe the details of theextrapolation-based side information creation and corre-lation noise modeling modules which are the most novelin the proposed ALD-DVC solution.

3.1. Extrapolation-based side information creation

This section describes in detail the extrapolation-basedside information creation process adopted in the pro-posed ALD-DVC codec. This process considers the samesub-modules as in the Low Delay IST DVC codec [12].However, as the description of the algorithms is rathershort in [12], the authors of this paper had to redesignmost of the tools; moreover, there are also some com-pletely novel ideas, notably for the treatment of the holesand overlapped pixels in the projected frame. The SideInformation Creation module can be broken down intofour sub-modules, namely motion estimation, motionfield smoothing, motion projection and, finally, the over-lapping and uncovered areas treatment (see Fig. 3).

(A)

Motion estimation: The first step in the side informa-tion creation process is the motion estimation sub-module which has the objective to create a motionvector field to be used to project/extrapolate the sideinformation for the next WZ frame to be decoded.Given a current frame n, this module uses the twopreviously decoded frames, n�1 and n�2 (either keyframes or WZ frames), with the purpose to build amotion vector field. The algorithm used in the crea-tion of this motion vector field, commonly known asFull Search Block Matching (FSBM), is presentedbelow:

1. Block candidate search: Taking frame n�1, and usinga specific block size, e.g. 8�8 samples, a search isperformed in frame n�2 for each block in framen�1 at position (x,y), referring to the central positionof each block, trying to find the best suitable matcharound that position (x,y). As there is no gain inperforming the search in the entire frame n�2, awindow surrounding position (x,y) is used to limitthe complexity of the search; typically, the searchwindow has a size of 16 samples in both directions.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 225

This block candidate search is performed for everyblock in frame n�1.

2. Matching error computation: To find the best matchin the previous frame n�2, this search process usesan error metric known as weighted mean absolutematching difference (WMAD). The WMAD is com-puted based on the absolute difference between theblock in frame n�1 and the block being analyzed inframe n�2, as in Eq. (1), weighted by a factordepending on the distance between the two blocksbeing analyzed; in this way, the WMAD provides abetter metric than the simple mean absolute differ-ence (MAD):

WMAD¼

P7dx ¼ 0

P7dy ¼ 0 Xn�1ðx,yÞ�Xn�2ðxþdx,yþdyÞ

��� ���N

�ð1þKffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffidx2þdy2

qÞ ð1Þ

where Xn�1 and Xn�2 represent frames n�1 andn�2, N is the number of pixels in the block of framen�1 which motion is being estimated with theWMAD criterion (8�8 in this case), dx and dy referto the displacement between the block in framen�1 and the blocks in frame n�2 being compared,and K is a smoothness constant controlling thepenalty introduced when the motion vectors go toextreme positions in the search range; experimen-tal results suggest that K¼0.05 provides a goodtrade-off between efficiency and quality smooth-ness. By performing the motion search with thisWMAD criterion for all the blocks in frame n�1, amotion vector field is created.

(B)

Motion field smoothing: The second sub-module in theSide Information Creation module targets thesmoothing of the motion vector field created in theprevious step with the purpose to add some robust-ness to the motion estimation process as follows:1. Neighbor blocks definition: To address the objective

above, for each specific block and the associatedmotion vector in frame n�1, a better andimproved motion vector is obtained using alsothe motion vectors from the neighbor blocks.

2. Motion vector field smoothing: Having defined theneighbor blocks to be used, a new motion vector iscomputed as the median value, for both compo-nents (x,y), of all the motion vectors of the avail-able neighboring blocks and also the currentblock. This median value becomes the new motionvector for that specific block. Using the mean ofthe available motion vectors instead of the med-ian would provide a less reliable solution as theaverage may be strongly conditioned by a single‘very bad’ value which does not happen with themedian which ‘filters’ the ‘outliers’.

(C)

Motion projection: To create the extrapolated frame n,this means the side information for the next WZframe to be decoded, a motion projection is per-formed for each block in frame n�1. By applyingthe motion vector field resulting from the secondsub-module to every 8�8 block in frame n�1, anextrapolated frame n is obtained. This process may

lead to two types of problems addressed by the nextsub-module: overlapping areas, and holes corre-sponding to uncovered areas in the projected frame.

(D)

Overlapping and uncovered areas treatment: Thefourth sub-module addresses the problems resultingfrom the previous sub-module where a frame projec-tion was performed with the previously computedmotion field: overlapping and uncovered areas(holes) in the projected frame.(D.1) Overlapping areas: The overlapping areas corre-

spond to the areas in the projected frame whichare covered by more than one sample projectedfrom the previous frame n�1; this means thereis more than one estimation value for the sameframe position and, thus, some unique solutionmust be determined. The solution adopted hasbeen to average the alternative, competingvalues for the overlapping samples, and usethose values as the final estimated value; as theestimation value must be an integer, the averageabove is integer truncated.

(D.2) Uncovered areas: The uncovered areas corre-spond to the areas in the projected framewhich are not covered by any sample projectedfrom the previous frame n�1; this means thereis no estimation value for those frame posi-tions. The solution adopted here to fill thoseuncovered areas has been to average (andinteger truncate) the values of the surroundingprojected samples.

1. Uncovered areas detection: The scanning

algorithm to find the uncovered areaswithin each frame is performed from topto bottom and from left to right. Specialsituations include the presence of a holefor the first sample (top-left) in the frame,which is solved by copying the sample in thesame position from the previous frame.

2. Uncovered areas filling: The detected uncov-ered areas are filled using the average valuefor the surrounding eight samples, excludingthose which are also in uncovered areas.Again, as the estimated value has to be aninteger, integer truncation of the average is

performed.

Fig. 4 column (a) represents the side informationframe without using the proposed motion vectorfield smoothening process; next, column (b) shows,for the same frame, the improvements achieved withthe usage of the motion field smoothing process;finally, column (c) demonstrates the application ofthe last sub-module, the overlapping and uncoveredareas treatment, where the holes are represented bythe black zones which are eliminated with theproposed algorithms.

3.2. Extrapolation-based correlation noise modeling

The correlation noise modeling (CNM) is a veryimportant module in the DVC architecture with the main

(A)

compute the residue, depending on the situation, as

Fig. 4. (a) SI after motion projection without motion vector field smoothing; (b) SI after motion projection with motion vector field smoothing; and (c) SI

after motion projection with motion vector field smoothing and treatment of overlapping and uncovered areas.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235226

objective to provide the turbo decoder with a measure ofconfidence on the generated side information, and alsohelp in the reconstruction process. In this context, thecorrelation noise refers to the difference between eachoriginal (quantized) WZ frame available at the encoderand the corresponding side information created at thedecoder. To design a practical DVC solution, onlinecorrelation noise modeling has to be performed, corre-sponding to the process where the CNM parameters areestimated at the decoder in a practical way, thus withoutusing any original data never realistically available at thedecoder. As usual, a Laplacian distribution is used tomodel the distribution of the motion-compensated resi-dual DCT coefficients [20]; this distribution is typicallycharacterized by a single parameter, which has to beestimated, the so-called a parameter, since the residueaverage is assumed to be zero. Inspired by the CNMsolution in [20] developed for interpolation-based sideinformation, the novel CNM solution proposed here for

extrapolation-based side information may be dividedinto the six sub-modules presented in Fig. 5. In thispaper, the frame residue will be computed using theframes involved in the extrapolation process, i.e. theframes used to define the motion field and to projectthe side information for the current WZ frame. As theALD-DVC codec works in the transform domain, thecorrelation noise is nothing more than the residualbetween the DCT bands of the quantized WZ frame andthe corresponding SI frame.

Residual frame computation: The first step in the CNMprocess regards the computation of the residual framewhich should estimate the difference between theoriginal (quantized) WZ frame available at the enco-der and the corresponding side information created atthe decoder. There are three possible alternatives to

Fig. 5. Correlation noise modeling module architecture.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 227

the projected blocks in frame n clearly cannot havethe same treatment as the overlapped and uncoveredareas. Thus, this residual computation can be dividedinto four steps, being the first the identification ofeach type of situation, i.e. if it is a projected area, anoverlapping area or an uncovered area. The remainingthree steps are the residue computation for theprojected areas, the residue computation for theoverlapping areas and, finally, the residue computa-tion for the uncovered areas.1. Identification of projected areas versus overlapping/

uncovered areas: There are areas in the currentextrapolated frame n which were not obtainedthrough the motion projection step, i.e. the valuesof some samples were obtained using the over-lapped and uncovered areas treatment sub-modulepresented before. In this context, it is necessary toidentify the areas which were obtained throughmotion projection in order to estimate their resi-due. All the other areas have a different residuecalculation process detailed below.

2. Residue computation for projected areas: The resi-due estimation for the projected areas is per-formed as the subtraction between the blocks inthe previous frames n�1 and n�2 that project inthe current block in frame n, assuming linearmotion, as shown in Eq. (2). A low residue meanshere that the motion is well modeled and, thus,there is high confidence on the generated sideinformation. In this context, the residue is esti-mated as

Rðx,yÞ ¼ Xn�1ðx�dx1,y�dy1Þ�Xn�2ðx�dx2,y�dy2Þ ð2Þ

where Xn�1 and Xn�2 represent the two previouslydecoded frames used in the side informationcreation process and (dx1, dy1) represents themotion vector that created the block in frame n,i.e. the motion vector after the motion fieldsmoothing. The motion vectors (dx2, dy2) havetwice the size of the motion vectors (dx1, dy1),since frame n�2 is being linearly projected to atime instant two frame periods away. The residualimage computed here has two problems inherentto the side information creation process: the holesand overlapping areas. For these areas, the residuecomputation proceeds as detailed in the following.

3. Residue computation for overlapping areas: If someof the motion projected blocks overlap whencreating the SI frame then, according to the resi-due computation for projected areas step pre-sented before, there is one residue value for eachpossible (overlapping) sample value available forthat position. In this situation, it is proposed hereto average and truncate the various computedresidues.

4. Residue computation for uncovered areas: Theholes are filled with the truncated average ofthe surrounding residue values, excluding thosethat have no value associated. Special situationsinclude the presence of a hole for the first samplein the frame where no residue was computedand, thus, no averaging using neighbors is pos-sible; in this case, a zero value is attributed tothat residue.

(B)

|RDCT| frame generation: The residue frame R obtainedin the previous step has to be brought to the trans-form domain, as the information received in the turbodecoder regards the DCT bands from the encoded WZframe:1. RDCT computation: A 4�4 block-based integer

DCT transform is applied to the residual frameR(x,y) computed above in order to obtain the DCTcoefficients frame, RDCT(u,v).

2. 9RDCT9 frame generation: After, the absolute valuefor the frame RDCT(u,v) is computed, resulting in a9RDCT(u,v)9 frame.

(C)

|RDCT| band b variance computation: The objective ofthis variance computation is to provide a referencevalue when classifying the DCT coefficients resultingfrom the Side Information Creation module as moreor less reliable; this will also be used later in thecomputation of the a parameter (see Eq. (8)). UsingEq. (3), the RDCT band b variance s2

b is computed.

s2b ¼ Eb½ð9RDCT9bÞ

2��ðEbð9RDCT9bÞÞ

2ð3Þ

where Eb represents the expected value, i.e. the mean.If the CNM is performed at band level (for the wholeframe), the a parameter is estimated using Eq. (4),requiring only the value of the variance computed inthis step. If a more granular, and thus more accurate,correlation noise modeling is desired, e.g. at the

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235228

coefficient level, then the a parameter has to beestimated at a lower level, as proposed below.

ab ¼

ffiffiffiffiffiffi2

s2b

sð4Þ

(D)

9RDCT9 (u,v) DCT coefficient distance computation: Todistinguish more and less reliable DCT coefficientestimations coming from the side information crea-tion process, a distance Db(u,v) computed as thedifference between the 9RDCT(u,v)9 coefficient for agiven band b and the 9RDCT9 frame band b average mb

is required:

Dbðu,vÞ ¼ 9RDCT9b�mb ð5Þ

(E)

a parameter estimation at DCT coefficient (u,v) level:This final step performs the estimation of the aparameter for the DCT coefficient at position (u,v)using

abðu,vÞ ¼ab, ½Dbðu,vÞ�2r s2

b ,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2

½Dbðu,vÞ�2

q½Dbðu,vÞ�24 s2

b

8<: ð6Þ

To maximize the estimation accuracy, the solution in(6) distinguishes two situations, as motivated in detail in[20], depending on the estimated accuracy of the sideinformation.

4. Advanced low delay IST DVC CODEC with sideinformation refinement

As mentioned in previous sections, the main objectiveof this paper is to propose an efficient, and low delaypractical DVC codec. Naturally, to further boost the RDperformance of the ALD-DVC codec, new ideas must beintegrated; thus, a novel approach proposed by Martinset al. [21] for interpolation-based DVC architectures isconsidered here for an extrapolation-based architecture,acting at the level of the side information creationprocess. As shown in Fig. 6, the ALD-DVC encoder archi-tecture stays exactly the same, as opposed to the decoderarchitecture which includes a novel module, designedbased on the side information refinement (SIR) process

Fig. 6. ALD-DVC SIR

proposed in [21]. The novel SIR module results from asimple but rather powerful idea: by successively refiningthe side information along the decoding process, e.g. aftereach DCT band is decoded, it is possible to use succes-sively better side information; since the quality of sideinformation is critical for the RD performance of this typeof DVC codec, this SIR approach should contribute in theright direction, this means to improve the overall RDperformance. Using better SI for the decoding of the nextDCT bands will imply the need for a reduced WZ raterequested from the encoder to reach the same finalquality and, thus, better compression efficiency. More-over, the decoder complexity associated to the turbodecoding will be also significantly reduced as less paritybits chuncks will have to be decoded.

As explained in Section 3, when describing the variousALD-DVC modules, only the technical novelties will bedetailed in this section. This means that only two decodermodules deserve special attention, notably the SIR andthe Reconstruction modules. For more information on theother processes, please read [5,11].

4.1. Extrapolation-based side information refinement

The SIR module can be broken down into three mainprocessing steps, notably the block selection for refine-ment, the candidate block searching and the new sideinformation creation. Contrary to [21] where an inter-polation-based context is adopted, an extrapolation-basedside information refinement approach is adopted here. Asfar as the authors know, this is the first time an iterativeside information refinement approach is integrated in alow-delay DVC architecture.

(A)

arch

Initial DCT domain side information creation: This stepis not strictly part of the SIR algorithm as it corre-sponds to the side information creation process asalready performed in the previously presented ALD-DVC architecture.1. Initial side information frame creation: The first step

of this entire algorithm is associated to the crea-tion of the first side information frame (for each

itec

ture.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 229

frame) using the extrapolation-based techniquepresented in Section 3.1.

2. Pixel to transform domain: As soon as the initial SIframe is generated in the pixel domain, a 4�4integer DCT transform is applied to obtain thecorresponding DCT coefficients, which will serveas SI for the decoding of the first DCT band by theturbo decoder. This first DCT band is very impor-tant as it represents the DC band of the frame, thismeans band 0. All the other DCT bands needed forthe reconstruction process come from the initialside information extrapolated frame since afterdecoding the first band only this band has beenWZ ‘corrected’. Note that the SI blocks used in thedecoding process either come from the initial sideinformation process or from the side informationrefinement process, as represented in Fig. 6.From here, the steps presented below are succes-sively performed after the decoding of each DCTband with the exception of the last band, for atotal of 15 (4�4�1) bands.

(B)

Block selection for refinement: This sub-module has theobjective to determine, after decoding each DCT band,the blocks in the SI frame which are good candidatesfor SI refinement. With this target, the following stepsare performed:1. Block reconstruction: At any given time, the block

reconstruction process is the same, notably afterdecoding a given band b�1. Using the DCT bandsalready decoded, and copying the bands above orequal to band b (not yet decoded) from the (DCT)initial side information, it is possible to recon-struct the current frame. This reconstructionbegins after all the bitplanes, for a given DCT band,are decoded, reconstructing the frame and per-forming an inverse DCT transform to go from thetransform to the pixel domain.

2. Current error computation: As soon as the recon-structed current frame is available, it is necessaryto assess the error between its blocks and thecorresponding ones in the initial side information.The current decoded frame is an improved versionof the initial side information frame, since it hasbeen built using already some decoded, and thuscorrected, DCT bands and the remaining initialside information DCT bands. Thus, by checkingthe error between this current frame and the SIframe blocks, using Eq. (7), it is possible todetermine for each block how good is the originalSI frame regarding the already decoded frame. Thehigher this computed error, the higher the numberof errors corrected by the turbo decoder and, thus,the worse the initial side information

ebnð0Þ ¼

X3

x ¼ 0

X3

y ¼ 0

ðYnðx,yÞ�Rb�1n ðx,yÞÞ2 ð7Þ

As shown in Eq. (7), Yn(x,y) and Rb�1n (x,y) represent

the same block in the initial SI frame and thereconstructed frame, respectively, for a givenblock n and band b�1. In this context, eb

nð0Þ is

nothing more than the sum of the squared errorsfor the same block after decoding band b�1,computed for a block size of 4�4, as describedabove.

3. Block selection: This step has the target to classifythe blocks of the current frame as good candidatesto be refined or not (since it is impossible to know,for sure, at this stage). A block is considered a goodcandidate for refinement if the sum of the squarederrors eb

nð0Þ computed above exceeds a certainthreshold m. The value adopted for m, obtainedthrough extensive experiments by Martins et al.[21], is 100.

(C)

Candidate blocks searching: After identifying theblocks in the side information frame which refine-ment seems to be promising, it is necessary to findthe SI candidate blocks that can replace and improvethe SI frame quality. This sub-module performs thisanalysis for each 4�4 block capable of beingimproved:1. Candidate blocks identification: For each block with

promising refinement, a search is performed in theinitial side information frame using a certainwindow size. As there is no need to search theentire frame for a candidate block, a window of(2�w)þ1� (2�w)þ1 is considered, with w¼4.This value for w was obtained through extensiveexperiments, as performed to determine m above,and it represents a good trade-off between the RDperformance and complexity [21]. This means thatthere are 80 possible new SI blocks for each blockselected for refinement (excluding the block cho-sen for refinement itself).

2. Matching error computation: From these 80 possi-ble candidate SI blocks, it is necessary to performsome filtering since not all blocks are suitableenough to be considered good candidate blocks.With the purpose of filtering the undesired (or notbetter enough) SI candidates, an error metriccomputed as the sum of the squared errorsbetween the block being refined in the recon-structed frame Rb�1

n and the candidate block k inthe initial side information frame YdðkÞ

n is computedas follows:

ebnðkÞ ¼

X3

x ¼ 0

X3

y ¼ 0

ðYdðkÞn ðx,yÞ�Rb�1

n ðx,yÞÞ2 ð8Þ

dðkÞ ¼ ðdxðkÞ,dyðkÞÞ; dxðkÞ,dyðkÞ 2 ½�w,w� ð9Þ

As shown in Eq. (9), d(k) is the displacementassociated to the candidate block k with size4�4 and (x,y) corresponds to the pixel positioninside that same block. Note that the displacementof block k is limited by a window size of 9�9 (the81 candidates minus the block selected for refine-ment itself), as w¼4 in (9).

3. Candidate blocks filtering: Depending on the errorcomputed above, a SI candidate block is kept ornot based on Eq. (10). To dismiss all the SI

(D)

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235230

candidate blocks proving to be just as equal or notmuch better than the SI block under refinement,the sum of squared errors for a given candidateblock k has to be inferior to the sum of squarederrors for the initial SI block with a penalty P. Thevalue for this penalty is 0.2, as this represents agood trade-off between the RD performance andthe additional decoder complexity

ebnðkÞoeb

nð0Þð1�PÞ ð10Þ

From the 80 candidate blocks, only those fulfillingEquation (10) will be considered to be eligible forthe refinement process. Thus, given a certain eb

nðkÞ

for a given block k, a weight bbn, as defined in

Eq. (11) must be defined in order this candidateimpacts more or less the refined SI for that block. Ifthe sum of squared errors for a given candidateblock has a low value, demonstrating that it is agood estimation, then the weight computed isincreased, providing a higher confidence in thatcandidate, and vice-versa

bbn ¼

1eb

nðkÞif eb

nðkÞoebnð0ÞUð1�PÞ

0 otherwise

(ð11Þ

New side information creation: The final step of therefinement process corresponds to the generation ofthe new improved SI frame with Eq. (12), using all theapproved candidate blocks. Instead of just relying onthe best candidate, this means the one with the lowersum of squared errors, a statistical approach isadopted using the weights bb

n associated to theselected candidate blocks k determined in the pre-vious module. Thus, by using a normalized andweighted mean, a new, refined SI for block n isobtained:

YSIRn ðx,yÞ ¼

PkðY

dðkÞn ðx,yÞbb

nðkÞÞPkb

bnðkÞ

ð12Þ

Note that this whole refinement process is performedafter band b�1 is decoded; hence, this new refined sideinformation frame helps decoding the next band b andobtaining the next DCT coefficient estimations.

4.2. Reconstruction

This section describes the reconstruction process,essential for the refinement process above described.There are two ways to reconstruct the DCT bands/coeffi-cients: the first involves the reconstruction of the quan-tized DCT bands while the second regards thereconstruction of the unquantized DCT bands for whichno WZ bits were sent from the encoder to the decoder.These unquantized DCT bands result from the use ofdifferent RD points when encoding the sequence, i.e.depending on the matrices used in the quantizationprocess, more or less DCT bands are sent to the decoder.The main difference between the reconstruction process inthe ALD-DVC SIR codec and the one used in the ALD-DVCcodec is simply the number of times the reconstruction

process is performed. While the ALD-DVC codec only per-forms the reconstruction process once for each frame, afterdecoding all (coded) DCT coefficients, the ALD-DVC SIR codecperforms the reconstruction 16 times (corresponding to the4�4 bands) for each frame.

(A)

Quantized DCT bands: After successfully decoding agiven DCT band in the turbo decoder, it is possible toinitiate the reconstruction process. As the decodingprocess only gives the DCT coefficient bin q’ wherethe original DCT coefficient should lie, it is necessaryto determine a precise value for the reconstructedDCT coefficient. The adopted reconstruction functionis optimal in the sense that it minimizes the MSE ofthe reconstructed value for each DCT coefficient asdesigned in [22].

(B)

Unquantized DCT bands: The ALD-DVC codec simplycopies from the SI DCT coefficients the bands forwhich no WZ bits were sent. This may imply someproblems, mainly because there may still exist manyerrors left uncorrected in these specific bands (notcorrected at all). The ALD-DVC SIR codec appliesexactly the same process but adds on the top of itthe novel refinement process, which boosts the RDperformance as rate gains are achieved by refining theSI frame.

5. Performance evaluation

This section evaluates the performance of the ALD-DVC and ALD-DVC SIR codecs in comparison with relevantbenchmarks, both DVC and standard based codecs. Withthis purpose, the following test conditions were used:

Test sequences: Foreman, Hall Monitor and Coastguard(full) sequences since they represent types of videocontent associated to applications for which DVC maybe promising. � Temporal and spatial resolution: Sequences were coded

at 15 Hz with QCIF resolution (176�144 luminancesamples).

� GOP sizes: The GOP size may be 2, 4 or even 8; in case

of omission, a GOP size of 2 is assumed.

� Rate distortion points: To define several RD trade-off

points, eight quantization matrices were adopted as in[5,11]; the eight 4�4 quantization matrices define thenumber of quantization levels associated to the var-ious DCT coefficients bands (see Fig. 7). The use ofquantization matrices from (a)–(h) in Fig. 7 corre-sponds to a quality improvement but also to a bitrateincrease. The key frames are encoded in such a way(this means using a quantization step) to have asimilar average quality to the WZ frames in order theoverall video quality does not have significant tem-poral variations. It is important to notice that thisclaim regards the interpolation-based DISCOVER DVCcodec [11].

� Bitrate and PSNR: As usual in the DVC literature, only

the luminance of the original sequences is considered.Both the WZ and key frames bitrate and PSNR are

Fig. 7. Eight quantization matrices corresponding to the tested RD points.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 231

accounted for, as opposed to the low delay DVC codecsperformance typically reported in the literature.To allow a good knowledge on the ALD-DVC RDperformance, the codec performance is compared withsome of the state-of-art video coding solutions, bothDVC and standard-based codecs; for the comparisonsto be minimally fair, the selected standard-basedcodecs do not use motion estimation at the encoderto guarantee that all have a rather similar, low encod-ing complexity. In this context, the benchmark videocodecs selected are:

� H.263þ Intra: An important, although not anymore a

state-of-the-art, benchmark as it is used in many DVCpapers to check the RD performance results; no tem-poral redundancy is exploited in this codec.

� H.264/AVC Intra: The state-of-art on standard Intra

coding, in this case using the Main profile; again, itdoes not exploit temporal redundancy [23].

� H.264/AVC zero motion: As opposed to the H.264/AVC

Intra benchmark, there is here exploitation of thetemporal redundancy although without using motionestimation, to limit the encoder complexity [23].

� DISCOVER DVC: Considered as the state-of-art in DVC

codecs [11], this codec is a good DVC benchmarkalthough using an interpolation-based approach togenerate the side information; as such, its performancemay only be taken as a limit to reach by the extra-polation-based (low delay) DVC codecs.

� IST DVC interpolation: Corresponds to an interpolation-

based DVC codec using the side information creationsolution described in [5].

It is important to explain here that no low delay DVCcodec is used as benchmark for the simple reason thatthere is no fully realistic codec available in the literaturesince, as described in Section 2, all low delay DVC codecsin the literature adopt some unrealistic assumption, suchas using original key frames at the decoder for the sideinformation creation or request stopping criterion.

(A)

RD performance evaluation for GOP size 2: The RDperformance results for GOP size 2 are presented inFig. 8.� ALD-DVC SIR versus ALD-DVC: Although the com-

parison with other more common benchmarkingsolutions presented below is important, this com-parison between the two proposed DVC solutionsis very relevant as it shows the gains obtained byevolving from the ALD-DVC codec to the ALD-DVCSIR codec, this means the RD performance benefitsof using a SI refinement approach. As mentionedbefore, basically only rate gains are expected sincethe same quantization matrices are used and, thus,basically the same quality is obtained; in thiscontext, the only difference regards the size ofthe rate reduction obtained by using the additionalSIR module, corresponding to a RD performancegain for a certain quality. Bearing that in mind, itcan be concluded that the Hall Monitor and Coast-guard sequences present the lowest RD perfor-mance gains; this is expectable as these are themost stable sequence, presenting either rather lowcomplexity motion or well behaved motion and,thus, the initial side information is already rathergood and little refinement may be made. Still, theHall Monitor sequence shows gains up to 0.4 dB,proving that the ALD-DVC SIR codec performsbetter than the ALD-DVC codec. The ALD-DVC SIRcodec definitely shows better RD performanceresults for the more complex motion sequences,notably Foreman with major improvements up toalmost 3 dB. This was expected since, when themotion is more complex, the side informationextrapolation process is less reliable and, thus,the initial side information may be largely improvedwith the novel information obtained with the suc-cessive decoding of the DCT bands.� ALD-DVC SIR versus DISCOVER DVC: The DISCOVER

DVC codec performs better for the Hall Monitor

Fig. 8. ALD-DVC and ALD-DVC SIR RD performance comparison for all test sequences, QCIF, 15 Hz, GOP size 2.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235232

and Coastguard sequences, with its RD perfor-mance curve always above the ALD-DVC SIR codecRD performance curve; this behavior is invertedfor the Foreman sequence . In fact, for the Foremansequence, the ALD-DVC SIR codec performs ratherwell, increasing the gap between these two codecsas the rate grows; for the higher RD points,corresponding to the higher qualities, the gainsgo up to 1.5 dB. This fact implies that the gainsobtained with the usage of the SIR tool in anextrapolation-based codec are higher than the

gains obtained with the usage of an interpolation-based SI creation solution, which has algorithmicdelay as an additional cost.� ALD-DVC SIR versus standard Intra codecs: When

comparing the ALD-DVC SIR codec against theH.263þ Intra standard, it is clear that its performanceis better for every sequence and RD point. Largergains occur for the Foreman and Hall Monitorsequences with gains up to 4 dB and 6.5 dB, respec-tively. When considering the state-of-art H.264/AVCIntra codec, the scenario is not as good but still

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 233

significant improvements are obtained with the ALD-DVC SIR codec with the exception of the Coastguardsequence where the RD performance is similar. Forthe Foreman sequence, the best case shows a gain of1 dB, favoring the codec proposed in this paper. Thegap widens even more for the Hall Monitor sequence,with the ALD-DVC SIR codec achieving a gain up toabout 2.5 dB. In summary, the SIR tool allows

Fig. 9. ALD-DVC and ALD-DVC SIR RD performance compar

reducing the losses or increasing the gains of theALD-DVC DVC codec regarding the H.264/AVCIntra codec.� ALD-DVC SIR versus H.264/AVC zero motion: As

mentioned before, the H.264/AVC Zero Motioncodec exploits part of the temporal redundancy,increasing the RD performance, notably forsequences with low complexity motion, such as

ison

for GOP sizes 2, 4 and 8, QCIF, 15 Hz.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235234

the Hall Monitor sequence, since there are largestatic areas. Thus, for that particular sequence, theALD-DVC SIR codec still registers losses up to 2 dB,as the H.264/AVC Zero Motion codec alreadygreatly explores the redundancy between frames.However, for the Foreman sequence the proposedALD-DVC SIR codec already presents better resultswith gains up to about 0.5 dB.

(B)

RD performance evaluation for longer GOP sizes: Thissection intends to study the variation of the RDperformance with the GOP size; as such, only theALD-DVC and ALD-DVC SIR RD performances will beanalyzed, using GOP sizes 2, 4 and 8, as shown inFig. 9. The longer the GOP size, the lower the overallencoder complexity since the WZ frames encodingprocess is typically less complex than the key framesencoding process [5].� ALD-DVC and ALD-DVC SIR codecs performance for

different GOP sizes: As expected, as the GOP sizeincreases, there is a drop in the RD performancefor the Foreman and Coastguard sequences asopposed to the Hall Monitor sequence, which isthe most stable. The reduction of Intra coded keyframes with increased quality over the WZ framestakes a toll in the performance of both ALD-DVCcodecs. With lower quality decoded frames used inthe extrapolation process, there is a growing dropin the reliability of the motion estimation processand the side information quality, as the GOP sizeincreases. As there is more motion in the Foremanand Coastguard sequences, this expected behavioris observed in Fig. 9. For the Hall Monitor sequ-ence, which is characterized by very low motion,the RD performance for both ALD-DVC codecsgrows with the GOP size. The refinement processallied to the good motion estimation performed bythe Side Information Creation module is veryefficient for this particular sequence. For the Fore-man sequence, there is almost no drop in the ALD-DVC SIR RD performance curves as the GOP sizeincreases from 2 to 8 which seems to indicate thatthe SIR tools compensates the lower quality of theinitial side information which is not the case forthe ALD-DVC codec.� ALD-DVC SIR versus ALD-DVC using same GOP sizes:

The RD performance comparison between theALD-DVC and the ALD-DVC SIR codecs using thesame GOP size seems to indicate that the ALD-DVCSIR codec gains typically increase with the GOPsize. The largest gains are obtained for the Fore-man sequence where the gains for GOP size 4 goup to 4 dB, and for GOP size 8 go up to 5 dB.

(C)

Complexity and delay considerations: Although a newmodule has been included in the ALD-DVC SIR deco-der, the overall decoding complexity is typicallyreduced as the improved side information reducesthe number of times the turbo decoder has to usethe feedback channel to ask more parity bits from thedecoder since fewer errors have to be corrected. Thisallows reducing the decoding complexity as less runsof the turbo decoder have to be performed.

interpolation-based DVC solutions, such as the DISCOVER

Moreover, the algorithmic delay is reduced regarding

DVC codec, since no future based side information proces-sing is allowed. This algorithmic delay reduction is at leastone frame period for GOP size 2 and larger for longer GOPsizes. While an interpolation-based side information crea-tion solution implies algorithmic delay (like typically Bframes in predictive coding) which is impossible to reduce,the proposed interpolation-based side information creationsolution does not imply any algorithmic delay and, thus, thedelay may be as reduced as needed by software optimiza-tion or adding computational power at the decoder.

6. Final remarks

This paper proposes a novel efficient, low delay DVCarchitecture including novel extrapolation-based sideinformation generation and adaptive correlation noisemodeling tools integrated in a side information iterativerefinement framework. As far as the authors know, thisarchitectural design has never been proposed in theliterature. Moreover, this low delay DVC codec is the firstone in the literature designed and evaluated withoutadopting any unrealistic assumptions, such as usingoriginal key frames at the decoder for the side informa-tion creation or request stopping criterion.

The RD performance assessment made under relevanttest conditions and in comparison with appropriatebenchmarks demonstrated that the ALD-DVC SIR codecis not only low delay driven, since it uses side informationextrapolation-based techniques, but it is also rather effi-cient as it performs better or at least similarly to somerelevant standard based schemes, such as H.263þ Intraand even the very efficient H.264/AVC Intra codec. It wasalso shown that the ALD-DVC SIR codec performs verywell when compared to state-of-art DVC schemes, such asthe DISCOVER DVC codec, further closing the gap betweenextrapolation and interpolation-based DVC codecs.

Naturally, there are always new ideas to enhance theRD performance of DVC codecs, allowing for even largergains. Motion estimation is always a problem whenconsidering extrapolation techniques; hence, furtherresearch in this area should yield better results, andfurther close the gap between interpolation and extra-polation based DVC codecs. This may include the usage ofmore sophisticated motion models and more than twopast decoded frames in the interpolation process. Alsousing a more efficient key frames coding solution couldboost the RD performance although this would raise theoverall encoding complexity. More accurate correlationnoise models are also very important to increase thecompression efficiency not only of the ALD-DVC SIR codecbut also of other DVC solutions; a way to raise thisaccuracy may rely on a better estimation of the residualinformation.

References

[1] J. Slepian, J. Wolf, Noiseless coding of correlated informationsources, IEEE Transactions on Information Theory 19 (4) (1973)471–480.

A. Tome, F. Pereira / Signal Processing: Image Communication 26 (2011) 220–235 235

[2] A. Wyner, J. Ziv, The rate-distortion function for source coding withside information at the decoder, IEEE Transactions on InformationTheory 22 (1) (1976) 1–10.

[3] F. Pereira, et al., Distributed video coding: selecting the mostpromising application scenarios, Signal Processing: Image Commu-nication 23 (5) (2008) 339–352.

[4] B. Girod, A. Aaron, S. Rane, D.Rebollo Monedero, Distributed videocoding, Proceedings of the IEEE 93 (1) (2005) 71–83.

[5] F. Pereira, C. Brites, J. Ascenso, in: Michael Gastpar, Pier Luigi Dragotti(Eds.), Distributed video coding: basics, codecs and performance, in:Distributed Source Coding: Theory, Algorithms and Applications,Academic Press, 2009.

[6] C. Guillemot, F. Pereira, L. Torres, T. Ebrahimi, R. Leonardi,J. Ostermann, Distributed monoview and multiview video coding,IEEE Signal Processing Magazine 24 (5) (2007) 67–76.

[7] A. Aaron, R. Zhang, B. Girod, Wyner–Ziv coding of motion video, in:Asilomar Conference on Signals, Systems and Computers, PacificGrove, CA, USA, November 2002.

[8] A. Aaron, S. Rane, E. Setton, B. Girod, Transform-domain Wyner–Zivcodec for video, in: Visual Communications and Image ProcessingConference, San Jose, CA, USA, January 2004.

[9] R. Puri, K. Ramchandran, RISM: a new robust video coding archi-tecture based on distributed compression principles, in: 40thAllerton Conference on Communication, Control and Computing,Allerton, IL, USA, October 2002.

[10] R. Puri, A. Majumdar, K. Ramchandran, PRISM: a video codingparadigm with motion estimation at the decoder, IEEE Transactionson Image Processing 16 (10) (2007) 2436–2448.

[11] DISCOVER Project Page, /http://www.img.lx.it.pt/�discover/home.htmlS.

[12] L. Natario, C. Brites, J. Ascenso, F. Pereira, extrapolating sideinformation for low-delay pixel-domain distributed video coding,International Workshop on Very Low Bitrate Video Coding, Sardi-nia, Italy, September 2005.

[13] W.A.R.J. Weerakkody, W.A.C. Fernando, J.L. Martınez, P. Cuenca, F.Quiles, An iterative refinement technique for side information

generation in DVC, in: IEEE International Conference on Multi-media and Expo, Beijing, China, July 2007.

[14] A.B.B. Adikari, W.A.C. Fernando, W.A.R.J. Weerakkody, Multiple sideinformation streams for distributed video coding, IET ElectronicsLetters 42 (25) (2006) 1447–1449.

[15] A.B.B. Adikari, W.A.C. Fernando, W.A.R.J. Weerakkody, H.K. Arachchi,Sequential motion estimation using luminance and chrominanceinformation for distributed video coding of Wyner–Ziv frames, IEEElectronics Letters 42 (7) (2006) 398–399.

[16] S. Borchert, R.P. Westerlaken, R. Klein Gunnewiek, R.L. Lagendijk,Improving motion compensated extrapolation for distributed videocoding, in: Thirteenth Annual Conference of the Advanced Schoolfor Computing and Imaging, Heijen, The Netherlands, June 2007.

[17] S. Borchert, R.P. Westerlaken, R. Klein Gunnewiek, R.L. Lagendijk,On extrapolating side information in distributed video coding, in:26th Picture Coding System, Lisbon, Portugal, November 2007.

[18] G. de Haan, P.W.A.C. Biezen, H. Huijgen, O. Ojo, True-motionestimation using 3-D recursive-search block matching, IEEE Trans-actions on Circuits and Systems for Video Technology 3 (5) (1993).

[19] R. Braspenning, G. de Haan, Efficient motion estimation withcontent-adaptive resolution, in: Proceedings of International Sym-posium on Consumer Electronics, September 2002, pp. 29–34.

[20] C. Brites, F. Pereira, Correlation noise modeling for efficient pixel andtransform domain Wyner–Ziv video coding, IEEE Transactions onCircuits and Systems for Video Technology 18 (9) (2008) 1177–1190.

[21] R. Martins, J. Ascenso, C. Brites, F. Pereira, Refining side informationfor improved transform domain Wyner–Ziv video coding, IEEETransactions on Circuits and Systems for Video Technology 19 (9)(2009) 1327–1341.

[22] D. Kubasov, J. Nayak, C. Guillemot, Optimal reconstruction inWyner–Ziv video coding with multiple side information, in: IEEEMultimedia Signal Processing Workshop, Chania, Crete, Greece,October 2007.

[23] T. Wiegand, G.J. Sullivan, G. Bjntegaard, A. Luthra, Overview of theH.264/AVC video coding standard, IEEE Transactions on Circuitsand Systems for Video Technology 13 (7) (2003) 560–576.