Drift compensation for reduced spatial resolution transcoding

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 11, NOVEMBER 2002 1009

Drift Compensation for Reduced SpatialResolution Transcoding

Peng Yin, Member, IEEE, Anthony Vetro, Member, IEEE, Bede Liu, Fellow, IEEE, and Huifang Sun, Fellow, IEEE

Abstract—This paper discusses the problem of reduced-resolu-tion transcoding of compressed video bitstreams. An analysis ofdrift errors is provided to identify the sources of quality degra-dation when transcoding to a lower spatial resolution. Two typesof drift error are considered: a reference picture error, which hasbeen identified in previous works, and error due to the noncom-mutative property of motion compensation and down-sampling,which is unique to this work. To overcome these sources of error,four novel architectures are presented. One architecture attemptsto compensate for the reference picture error in the reduced res-olution, while another architecture attempts to do the same in theoriginal resolution. We present a third architecture that attemptsto eliminate the second type of drift error and a final architecturethat relies on an intrablock refresh method to compensate for alltypes of errors. In all of these architectures, a variety of macroblocklevel conversions are required, such as motion vector mapping andtexture down-sampling. These conversions are discussed in detail.Another important issue for the transcoder is rate control. Thisis especially important for the intra-refresh architecture since itmust find a balance between number of intrablocks used to com-pensate for errors and the associated rate-distortion characteris-tics of the low-resolution signal. The complexity and quality of thearchitectures are compared. Based on the results, we find that theintra-refresh architecture offers the best tradeoff between qualityand complexity and is also the most flexible.

Index Terms—Drift compensation, MPEG-2 to MPEG-4,reduced resolution, transcoding.

I. INTRODUCTION

I N THE context of coding and transmission, there is an in-creasing need to perform many types of conversions to ac-

commodate terminal constraints, network limitations, or pref-erences of a user. On one side there exists a rich set of contentthat is rapidly growing by the day, and on the other, we haveterminals with varying capabilities. Furthermore, these two en-tities may be connected through a variety of different networkconfigurations. It should be clear that the transmission and play-back of content in such a diverse and dynamic environment mayrequire many types of different conversions.

In this paper, we focus our attention to the problem of reducedresolution transcoding. Specifically, we consider techniques andarchitectures that convert a compressed video bitstream that has

Manuscript received October 16, 2001; revised August 1, 2002. This workwas supported in part by a New Jersey State R&D Excellence Grant. This paperwas recommended by Associate Editor A. J. Luthra.

P. Yin was with the Department of Electrical Engineering, Princeton Univer-sity, Princeton, NJ 08544 USA. She is now with Corporate Research, ThomsonMultimedia Inc., Princeton, NJ 08540 USA (e-mail: [email protected]).

B. Liu is with the Department of Electrical Engineering, Princeton University,Princeton, NJ 08544 USA (e-mail: [email protected]).

A. Vetro and H. Sun are with the Mitsubishi Electric Research Laboratories,Murray Hill, NJ 07974 USA (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSVT.2002.805509

been encoded at one spatial resolution to an output bitstreamwith half the spatial resolution. The methods presented in thispaper aim to minimize the amount of drift in the reconstructedsequence. The transcoding methods themselves can be appliedwithin the same syntax format or between different syntax for-mats. Since MPEG-4 is adopted as the solution for mobile mul-timedia communications and a large amount of MPEG-1/2 con-tent is available, we often focus our discussion on MPEG-2 toMPEG-4 transcoding. This is particularly important due to thelimited display size of mobile terminals. However, details on thedifferences between syntax formats are not elaborated on.

Actually, transcoding has been an extensively studied topicin recent years. In earlier works, bit-rate reduction techniqueswere explored to meet available channel capacity [1]–[3]. Thesearchitectures were further refined in [4] to reduce overall com-plexity. Additionally, researchers have investigated conversionsbetween constant bit-rate (CBR) streams and variable bit-rate(VBR) streams to facilitate more efficient transport of video [5].To improve the performance, techniques to refine the motionvectors have also been proposed [6].

Several papers have addressed the problem of resolution con-version [7]–[10]. The primary focus of the work in [7] was onmotion vector scaling techniques. In [8], the authors proposeto use DCT-domain downscaling and motion compensation fortranscoding, while also considering motion vector scaling andcoding mode decisions. With the proposed two-loop architec-ture, computational savings of 40% have been reported with aminimal loss in quality. In [9] and [10], a comprehensive studyon the transcoding to lower spatio-temporal resolutions and todifferent encoding formats has been provided based on the reuseof motion parameters. In this work, a full decoding and encodingloop were employed, but, with the reuse of information, pro-cessing time was improved by a factor of 3.

In this paper, we provide a different approach to the problemof spatial resolution transcoding by providing a detailed ana-lytical analysis of the overall architecture. In doing so, we givesignificant attention to the degradation of reconstructed mac-roblocks in reduced resolution pictures. Based on this analysis,several new architectures for compressed-domain transcodingare developed and analyzed. We also address the issue of con-structing a reduced-resolution macroblock from macroblocks inthe original bitstream that have different coding modes. We referto this problem as themixed block problemand present severalway to overcome it.

The organization of the paper is as follows. We shall focuson the development of drift compensation architectures withtranscoding performed in the compressed domain. SinceB-frames do not introduce any drift propagation, our discussion

1051-8215/02$17.00 © 2002 IEEE

1010 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 11, NOVEMBER 2002

is focused on P-frames only. Before presenting the variousarchitectures themselves, an analysis of the drift problem forreduced-resolution transcoding is given in Section II. Here, weintroduce a reference architecture and compare it to a simpleopen-loop transcoding architecture. The open-loop architectureis used to identify the sources of drift error with respect to thereference. In Section III, the various types of macroblock-levelconversions that are needed for reduced resolution transcodingare considered. Based on the results of our analysis, we presentfour transcoding architectures that attempt to overcome the typesof drift errors that have been identified. These architecturesare presented in Section IV. In Section V, rate control issuesare discussed. Finally, experimental results, including qualityand complexity comparisons, are presented and discussed inSection VI, and our concluding remarks are given in Section VII.

II. DRIFT ERRORANALYSIS

In video coding, drift error refers to the continuous decreasein picture quality when a group of motion-compensated (MC)interframe pictures are decoded. In general, it is due to an ac-cumulation of error in the decoder’s MC loop; however, thecause of drift is due to many different factors. The problemof drift error has been studied for full-resolution transcoding[4] and for multilayer coding [11], but no explicit analysis hasbeen done for reduced-spatial-resolution transcoding. In thissection, we will analyze drift errors by comparing a cascadedclosed-loop architecture that is drift-free with an open-loop ar-chitecture. The open-loop architecture is the simplest and lowestcomplexity architecture that one can consider and it is charac-terized by severe drift errors due to many sources. In Section IV,we will consider new drift-compensating architectures based onthe drift error analysis. These new architectures attempt to sim-plify the drift-free architecture, while improving the quality ofthe open-loop architecture.

With regard to notation, lowercase variables indicate spatialdomain signals, while uppercase variables represent the equiv-alent signal in the DCT domain. The subscript on the variableindicates time, while a superscript equaling one denotes an inputsignal and a superscript equaling two denotes an output signal.

refers to down-sampling and to up-sampling. denotesfull-resolution motion compensation and denotes reduced-resolution motion compensation. indicates a full-resolutionimage signal, indicates a reduced-resolution image signal,indicates a full-resolution residual signal, andindicates a re-duced-resolution residual signal. denotes the set of full-res-olution motion vectors and denotes the set of reduced-res-olution motion vectors.

A. Reference Architecture

Fig. 1 shows a cascaded closed-loop transcoding architecture,which is referred to as theReference. This architecture is themost complex and costly, but it has no drift errors and providesa basis for drift error analysis and further simplification.

Since there is no MC prediction for I-frames, . Thereconstructed signal is then down-sampled

(1)

Then, in the encoder, .

Fig. 1. Reference architecture.

In the case of P-frames, the identity

(2)

yields the reconstructed full-resolution picture. As with theI-frame, this signal is then down-sampled via (1). Then, thereduced-resolution residual is generated according to

(3)

which is equivalently expressed as

(4)

The signal given by (4) represents the reference signal that isfree of drift errors. Based on this equation, we can analyze drifterrors.

B. Drift Error Analysis of Open-Loop Architecture

Fig. 2 shows the open-loop architecture for a transcoder,referred asOpenLoop. The objectives and functions of certainblocks, such as the Mixed-Block Processor, Motion Vector(MV) Mapping, and Downs-Sampling will be described laterin Section III. At this time, we are mainly concerned withthe analysis of drift errors caused by reduced-resolutiontranscoding. With thisOpenLooparchitecture, the reduced-res-olution residual is given by

(5)

Compared to (4), the drift errorcan be expressed as

(6)

YIN et al.: DRIFT COMPENSATION FOR REDUCED SPATIAL RESOLUTION TRANSCODING 1011

Fig. 2. Open-loop architecture.

where

(7)

and

(8)

In the above, the drift error has been decomposed into twocategories. The first componentrepresents an error in the ref-erence picture that is used for MC. This error may be caused byrequantization, eliminating some nonzero DCT coefficients, orarithmetic error caused by integer truncation. This is a commondrift error that has been observed in other transcoding works[4]. In this way, the pictures originally used as references by thetranscoder will be different from their counterparts in the de-coder, thus creating a mismatch between predictive and residualcomponents. The second componentis due to the noncom-mutative property of motion compensation and down-sampling,which is unique to reduced-resolution transcoding. There aretwo main factors contributing to the impact of: motion vector(MV) mapping and down-sampling. In mapping MVs from theoriginal-resolution to a reduced-resolution, a truncation of theMV is experienced due to the limited coding precision. In down-sampling to a lower spatial resolution in the compressed do-main, block constraints are often observed to avoid filters thatoverlap between blocks. Due to this effort to reduce complexity,the quality of the down-sampling process must be compromisedand some errors are typically introduced. Regardless of the mag-nitude of these errors for a single frame, the combination ofthese two transformations generally creates a further mismatchbetween the predictive and residual components that will in-crease with every successively predicted picture.

To illustrate this mismatch between predictive and residualcomponents due to the noncommutative property of motioncompensation and down-sampling, we consider an examplewith one-dimensional (1-D) signals and neglect any error dueto requantization (or ). Let denote the reconstructed block,

denote the reference block, anddenote the error (residual)block, all in the original resolution. Furthermore, let denotea full-resolution motion compensation filter and denotea reduced resolution motion compensation filter. Then, thereconstructed block in the original resolution is given by

(9)

If we apply a down-conversion process to both sides, we have

(10)

The quality produced by the above expression would not besubject to the drift errors included in . However, this is not

the signal that is produced by the reduced-resolution transcoder.The actual reconstructed signal is given by

(11)

Since does not usually equal , there isa mismatch between the reduced-resolution predictive andresidual components. To achieve the quality produced by(10), either or both of the predictive and residual compo-nents would need to be modified to match each other. In theReference architecture, this mismatch is eliminated with thesecond (encoder) loop that computes a new reduced-resolutionresidual. With this second loop, the predictive and residualcomponents are realigned. Our objective in Section III is toconsider alternative ways to compensate for drift with reducedcomplexity. Before that, we will first explain several issuesrelated to reduced-resolution transcoding.

III. M ACROBLOCK CONVERSIONS

In this section, we will discuss some issues related to re-duced-resolution transcoding, which correspond to three func-tional units: Mixed Block Processing, MV Mapping, and Tex-ture Down-sampling. These blocks are presented in theOpen-Loop architecture, which is shown in Fig. 2, and will also beused in the new drift-compensating architectures that will bepresented in Section IV.

A. Mixed Block Processing

In transcoding compressed video to a lower spatial resolution,a group of four macroblocks in the original video correspondsto one macroblock in the transcoded video. So, the purpose ofthe mixed-block processor is to preprocess selected groups ofmacroblocks to ensure that the down-sampling process will notgenerate an output macroblock in which its subblocks have dif-ferent coding modes, e.g., both inter and intra subblocks withina single macroblock. Mixed coding modes within a macroblockare not supported by any known video coding standards.

The function of the mixed-block processor includesproducing motion vectors and corresponding residues forinter-frames. Because inter-frames contain both intra- andinter-macroblocks, we need first to decide the macroblock(MB) mode in the output bitstream from the MB modes of thefour corresponding macroblocks in original bitstream. If thefour input MB modes are intra, the output MB mode is intra.Similarly, if the four input MB modes are inter, the output MBmode will also be inter. If neither of these cases is true, weencounter a group ofmixed blocks, i.e, a group of macroblocksthat contain both intra- and inter-coded macroblocks. In thiscase, we need to decide on a consistent mode and modify theassociated motion vectors and DCT coefficients accordingly so


that a nonmixed block is produced in the output. We proposethree methods to accomplish this mixed-block processing.

In the first method,ZeroOut, the MB modes of the mixedmacroblocks are all modified to inter-mode. The MVs for theintra-macroblocks are reset to zero and so are correspondingDCT coefficients. In this way, the input macroblocks that havebeen converted are replicated with data from correspondingblocks in the reference frame.

The second method is calledIntraInter. In this method, theMB modes are also modified to be inter-mode, but the motionvectors for the intra-macroblocks are predicted. The predictionis based on the data in neighboring blocks, which can includeboth texture and motion data. In our experiments, the followingestimate of the new motion vector is used:

(12)

where is the predicted motion vector for the intra-mac-roblock that needs to be converted, is the neighboringinter-macroblocks, and is the corresponding weighting factorbased on the DCT energy of the inter-macroblocks. As an al-ternative, we can simply set the motion vector to be zero, de-pending on which produces less residual. In an encoder, themean absolute difference of the residual blocks are typicallyused for mode decision. The same principles can be appliedhere. Based on the predicted motion vector, a new residual forthe modified macroblock is calculated.

In the third method,InterIntra, the MB modes are all mod-ified to intra-mode. In this case, there is no motion informa-tion associated with the reduced-resolution macroblock, there-fore all associated motion vector data are reset to zero. This isnecessary to perform in the transcoder because the motion vec-tors of neighboring blocks are predicted from the motion of thisblock. To ensure proper reconstruction in the decoder, the mo-tion vector for the particular group of macroblocks must be resetto zero in the transcoder. The intra-DCT coefficients are gener-ated to replace the inter-DCT coefficients.

It should be noted that, to implement the second and thirdmethods, we need a decoding loop to reconstruct a full-resolu-tion picture. The reconstructed data is used as a reference to con-vert the DCT coefficients from intra-to-inter, or inter-to-intra.For a sequence of frames with a small amount of motion anda low level of detail, the low-complexity strategy ofZeroOutcan be used. Otherwise, eitherIntraInter or InterIntra shouldbe used. The performance ofInterIntra is a little better thanIn-traInter, becauseInterIntra can stop drift propagation by trans-forming interblocks to intrablocks.

B. Motion Vector Mapping

When down-sampling four macroblocks to one macroblock,the associated motion vectors have to be mapped. Severalmethods suitable for frame-based motion vector mappinghave been described in past work [7], [12]. To map from fourframe-based motion vectors, i.e., one for each macroblock in agroup, to one motion vector for the newly formed macroblock,a weighted average or median filters can be applied. This isreferred to as a 4:1 mapping.

However, with certain compression standards, such asMPEG-4 and H.263, there is support in the syntax for advancedprediction modes that allow one motion vector per 88 block,i.e., subblock motion. In this case, each motion vector ismapped from a 16 16 macroblock in the original resolutionto an 8 8 block in the reduced resolution macroblock withappropriate scaling by 2. This is referred as a 1:1 mapping.

Usually, 1:1 mapping performs better than 4:1 mapping ifno motion compensation is performed [12]. However, it is in-efficient to always use the 1:1 mapping, because more bits areused to code four motion vectors. We propose a mixed mappingmethod based on the variance of the four input motion vectors. Ifthe variance of these motion vectors is greater than a threshold,then use the 1:1 mapping method presented in [12], otherwiseuse 4:1 mapping.

Because MPEG-2 supports interlaced video, we also need toconsider field-based MV mapping. We first convert the field-based MVs to a frame-based MV by using the top-field MV asthe default [13]. However, if the bottom-field MV is used to pre-dict the top field, the top-field and bottom-field motion vectorsare averaged. Then, we divide the horizontal MV by 2 (the ver-tical MV is already halved by using the top-field MV). It shouldbe pointed out that there is a relation between the methods usedfor texture down-sampling (discussed below) and MV mapping.This is especially true when converting from an interlace sourceto a progressive output. We have tried various combinations ofdown-conversion filters and MV mappings and have verifiedexperimentally that this particular MV mapping method worksslightly better than if the top-field MV is always used.

C. Texture Down-Sampling

For the purpose of down-sampling texture data, this papermainly relies on the filtering concepts that we have developeda few years ago for memory saving decoders. A review ofthis work can be found in [14] and references therein. For thetranscoder, there are some new aspects that must be considered.However, since the derivation of new filters is beyond the scopeof this paper, we only provide a brief explanation of this processand expressions for particular filters used in our experiments.

In our earlier work, the concept of frequency synthesishas been proposed to transform an input DCT macroblockconsisting of four 8 8 DCT blocks into a single 8 8 DCTblock. The computations are actually performed on the rowsand columns of the macroblock using separable 1-D filters. Let

and denote the input vectors of size. Then, the outputdata , also of size , is computed according to

(13)

where

(14)


and

(15)

and for , and 1 for . The abovedown-conversion filters can be applied in both the horizontaland vertical directions and to both frame-DCT and field-DCTblocks. In our experiments, we aim to produce a progressive(MPEG-4) output. Therefore, if the input is a frame-DCT block,we apply the above filters to both fields. On the other hand,if the input is a field-DCT block, a different set of filters areused to simultaneously perform a field-to-frame conversion anddown-conversion in the DCT domain. These filters are given by

(16)

We would also like to point out that all of the above DCT filterscan easily be transformed to spatial-domain filters, which aremathematically equivalent [14]. In fact, such filters are used inour reference architecture to obtain a pure comparison of thearchitecture and avoid any differences due to down-sampling.

IV. DRIFT COMPENSATIONARCHITECTURES

In Section II, the drift errors related to reduced-resolutiontranscoding were analyzed by comparing the signals producedby theReferenceandOpenLooparchitectures. TheOpenLooparchitecture is the simplest, but the drift error can be very se-vere. On the other hand, theReferencearchitecture is completelydrift-free, but the complexity is high. Therefore, it is desirable toapproximate the quality, while achieving significant complexityreduction. In this section, we will present four new architec-tures that provide different means of drift-error compensation.The first two architectures can be viewed as extensions to thefull-resolution transcoding architectures presented in [4]. Thesignal given by (4) represents the reference signal that the ar-chitectures described here approximate. Since I-frames do notcause drift and the processing of I-frames is simple and mature[7], we will focus our discussion on P-frames only.

We note that several methods have been proposed to performmotion compensation in the frequency domain, e.g., [4]. In thisway, some DCT/IDCT computation can be avoided, but they in-volve matrix multiplications that may be computationally com-plex. In the architectures presented in this paper, we adopt spa-tial-domain motion compensation for simplicity. However, cor-responding frequency-domain approaches can easily be applied.

A. Drift Compensation in Reduced Resolution

Fig. 3 shows the first closed-loop architecture that performsdrift compensation in the reduced resolution. This architecturewill be referred to asDriftLow. The reduced-resolution residualsignal in Fig. 3 is expressed as

(17)

Fig. 3. Drift compensation in reduced resolution.

Fig. 4. Drift compensation in original resolution.

which can be generated from (4) with the approximation

(18)

The drift compensation is performed on a mac-roblock-by-macroblock basis using the set of reduced-resolu-tion motion vectors. This architecture attempts to eliminate drifterrors caused by . It should be noted that the mixed-blockprocessor would be implemented as an MC loop with framememory if theIntraInter or InterIntra configurations discussedin Section III-A are chosen. Only theZeroOutconfigurationwould avoid this additional loop.

B. Drift Compensation in Original Resolution

Fig. 4 presents the second closed-loop architecture that per-forms drift compensation in the original resolution. This archi-tecture will be referred to asDriftFull . The reduced resolutionresidual signal in Fig. 4 is expressed as

(19)


(20)

This architecture requires up-sampling in the DCT domain,which is briefly discussed below. In this architecture, the driftcompensation is performed on a macroblock-by-macroblock


basis using the set of original-resolution motion vectors. Thisarchitecture attempts to eliminate drift errors caused byas well as drift errors caused by . As with the DriftLowarchitecture, an additional MC loop with frame memory maybe substituted for the mixed-block processor depending on theparticular configuration that has been chosen, i.e.,ZeroOut,IntraInter, or InterIntra.

The up-sampling operation required by this transcoding ar-chitecture is unique from past up-sampling operations describedin [14] in that the subdivision back to four 88 blocks has notbeen considered. In our previous work, the up-sampling filterswere derived based on a least-squares approach that minimizedthe difference between the original and reconstructed data in thespatial domain. Specifically, the optimal up-sampling filteris given by , where denotesa concatenated down-conversion filter matrix based on the fil-ters given in (14). In this way, an DCT block can beup-sampled to a DCT block according to

(21)

However, with the above, the subblock structure of a mac-roblock is not recovered, i.e., a single 1616 DCT block willbe obtained, which is clearly not suitable for the differenceof four 8 8 DCT blocks that need to be computed afterthe up-sampling operation. Without proof, we claim that tworeconstructed DCT blocks and can be obtainedthrough the following expressions:

(22)

where

(23)

Applying the expression given in (22) to rows and columnsof the down-sampled DCT block allows us to reconstruct four8 8 DCT blocks. These blocks can be then be subtracted fromthe DCT blocks prior to encoding to yield the difference signalused for compensating the drift in the original resolution.

C. Partial_Encode Architecture

Fig. 5 shows the third closed-loop architecture that performsdrift compensation in the reduced resolution. We shall referto this architecture asPartial_Encode. The reduced resolutionresidual signal in Fig. 5 is expressed as

(24)


(25)

Fig. 5. Partial_Encodearchitecture.

Fig. 6. Intra_Refreshin OpenLooparchitecture.

The Partial_Encodearchitecture attempts to eliminate drifterrors caused by . It should be noted that, since full-resolutiondecoding is performed with this architecture, there is no mixedblock problem.

D. Intra_Refresh Architecture

As presented in Section III-A, theInterIntra and IntraInterconfigurations of the mixed-block processor require an addi-tional MC loop to reconstruct a full-resolution picture. In thissubsection, we exploit this structure by considering methods tocontrol or minimize the drift. We will present a new methodusing intra-refresh to stop the drift, which can be applied in bothopen-loop and close-loop architectures.

In the mixed-block processor,InterIntra provides a way toconvert an inter-coded block to intra-coded block. Becauseintra-coded blocks are not subject to drift, this kind of conver-sion stops the drift propagation, for both and . We referto this technique of usingInterIntra for drift compensation asIntra_Refresh. This method has successfully been applied toerror-resilience coding schemes [15], and we find the principleis also very useful for reducing drift in a transcoder. TheIntra_Refresharchitecture is illustrated in Fig. 6. As seen fromthe figure, this architecture only has one motion compensatedloop, which is used to reconstruct a full-resolution picture toconvert the DCT coefficients from inter-to-intra. As we willshow, this architecture is very flexible in that it can adaptitself based on the input signal characteristics and transcoding


condition. Besides providing an effective yet simple means tocombat both types of drift errors and , it is also capableof correcting any errors that may be caused by incorrectMV mapping. These properties make this architecture veryappealing. At this time, however, we focus our discussion onthe ability to correct drift errors only and do not consider anyother sources of error.

In the following, we consider two steps for the intra-refreshprocess. First, the amount of drift in the compressed bitstream isestimated. The drift can be estimated based on data pertaining tothe amount of motion vector truncation, residue energy, motionactivity, or requantization error. Second, the estimated value ofdrift is translated to an intra-refresh rate,, that is used as inputto the mixed-block processor.is the percentage of intra-codedmacroblocks in one frame. One point to keep in mind about theintra-refresh technique is that more bits are usually required forcoding intrablocks. For this reason, the intra-refresh process andthe rate control must be considered jointly. The specific issuesinvolved will be discussed in more detail in Section V.

V. RATE CONTROL

In this section, we first review the rate control framework thatis common to all the drift compensation architectures presentedin this paper. Then, special considerations for the rate controlthat must be accounted for in theIntra_Refresharchitecture arediscussed. These special considerations arise from the fact thatan increase in the number of intra-coded blocks should be con-sidered in determining the quantizer used to code the frame.

A. Common Framework

We begin our discussion with a brief overview of thequadratic texture model that we use for rate control, which wasproposed by Chiang and Zhang in [16]. Let represent thetexture bits spent for a frame, denote the quantization param-eter, ( , ) the first- and second-order model parametersand the encoding complexity. The relation betweenandis given by

(26)

In an encoder, the complexity measureis typically given bythe mean absolute difference of residual blocks. However, forcompressed-domain transcoding architectures, this measuremust be computed from the DCT coefficients. As a result, weadopt a DCT-based complexity measurewhich was presentedin [17] and is given by

(27)

where are the ac coefficients of a block, is a mac-roblock index in the set of inter-coded blocks, is thenumber of blocks in that set, and is a frequency-dependentweighting, e.g., a quantizer matrix.

Given the above model and complexity measures, the ratecontrol works in two main stages: a pre-encoding and post-en-coding stage. In the preencoding stage, the target estimate for

Fig. 7. RD curve of foreman forIntra_Refresharchitecture.

the frame is obtained based on available bit rate and buffer full-ness, and the complexity is also computed. Then, the value of

is determined based on these values and the current modelparameters. After encoding, the postencoding stage is respon-sible for calibrating the model parameters based in the actualbits spent and corresponding. This can be done by linear re-gression using the results of the pastframes.

B. Intra-Refresh Considerations

Theoretically, the effect of intra-refresh can be characterizedby the operational rate-distortion (R-D) function , i.e.,the average distortion is expressed as a function of the av-erage bit rate and intra-refresh rate. To illustrate the be-havior of this function, consider the Foreman sequence at CIFresolution, encoded at 2 Mbps with and .This bitstream is transcoded with a number of different fixedquantizer scales and various values of. The R-D curves foreach are shown in Fig. 7. The intra-refresh scheme that weused to generate these results is similar to that of H.263 [15].In this scheme, each macroblock is assigned a counter that isincreased if the macroblock is encoded in the interframe mode.If the counter reaches a threshold , which denotes theupdate interval, the macroblock is encoded in intramode and thecounter is reset to zero. By assigning a different initial offsets toeach macroblock, the updates of individual macroblocks can bespread over time.

Fig. 7 shows that overall quality is decreased at low bit rateswhen the intra-refresh rate is high. The reason is that too manybits are consumed by the intra-coded blocks without a sufficientincrease in quality. The opposite is observed at higher bit rates.With a larger amount of bits that can be spent per frame, theoverall quality is increased with more intra-coded blocks. Sincethe goal of the intra-refresh techniques is to minimize the effectof drift, we should point out that at lower bit rates,is likely todominate the overall error, while at higher bit rates the impactof is significantly less and is likely to be more dominant.

In theIntra_Refresharchitecture, it is important that the intra-refresh process be adaptive to account for the above character-istics, and also that the outcome of the process, i.e., the number


of intrablocks to be coded in a frame, be accounted for in certainaspects of the rate control. Specifically, the quantizer selectionand model parameter calculation. In the following, we proposean Adaptive Intra-Refresh (AIR) technique which is simple buteffective. In general, the objective of this technique is to dynam-ically determine the blocks to be converted from inter-to-intra.The particular scheme that we describe is adaptive according toavailable bit-rate and block attributes. After describing the tech-nique itself, we explain how the outcome is accounted for by therate control.

As analyzed in Section II, drift errors come from two mis-matches and . Through observation, we find that a largedrift error always correlates to inter-coded blocks with largeresidual energy or motion activity. Consequently, AIR decidesthat a group of (four) macroblocks need to be intra-coded if thesum of residual energy in this group of macroblocks is largerthan a threshold, , or if the sum of motion vector variance inthis group of macroblocks is larger than a threshold,. Initialvalues for the thresholds are determined experimentally througha simple linear relationship with the distortion (MSE). Specifi-cally, the relations are given by and ,where the parametersand are fitted based on several trainingsequences.1 After each frame is encoded, the thresholds aredynamically adjusted according to the difference between thetarget bit rate and actual bit rate. If the difference is positive, itimplies that the target quality is higher than the bits that have ac-tually been spent and we set the thresholds lower. On the otherhand, if the difference is negative, the thresholds are set higher.Since this AIR decision may ignore the inter-coded boundaryblocks of a moving object, we further expand the intra-refreshboundary to the left or right to cover an object boundary. Itshould be noted that this procedure is not optimal, but, as ex-perimental results will show, it works quite well. The main pur-pose of this scheme is to provide some means of adaptive intra-block conversion to illustrate the concepts and strengths of theIntra-Refresh architecture. An optimal scheme that considersrate-distortion tradeoffs would be a topic for further study.

With the intra-refresh procedure, the total number of intra-blocks may be high and must be accounted for in the rate con-trol. For the quantizer selection, a single quantization param-eter is selected for the frame and is applied to both intra-codedand inter-coded blocks. With this scheme, we consider a hybridcomplexity measure that accounts for both inter- and intra-DCTcoefficients. In other words, (27) is extended to include normal-ized intra-DCT coefficients as well. Specifically

(28)

where are the ac coefficients of an intra-coded block,isa macroblock index in the setof intra-coded blocks, is thetotal number of nonskipped blocks in a frame, and and

are a frequency-dependent weights for inter-coded and

1The training sequences were chosen outside of the sequences used for theexperimental results presented in Section VI.

TABLE ICOMPARISON OFTRANSCODING ARCHITECTURES

(�: INTERINTRA IS USED FOR THE

MIXED-BLOCK PROCESSOR)

intra-coded blocks, respectively. This hybrid complexity mea-sure is used to calculate the updated model parameters aftercoding. With these small modifications to the rate control, wehave found that a better fit between the rate-quantizer modeland the actual data is achieved.

VI. EXPERIMENTAL RESULTS

In this section, we present a comparative study of the qualityand complexity for the architectures under consideration.We simulated all the proposed architectures in software bytranscoding MPEG-1 and MPEG-2 bitstreams to quarter-reso-lution MPEG-4 bitstreams. The machine we used to run in thesimulations is an Ultra 60 workstation with processor speed360 MHz, model UltraSPARC-II, 4-MB L2 Cache, and 512MB memory, and the source code used in these experimentshas not been optimized for speed.

Independent of these above platform specifications, Table Ishows the main blocks contributing to the complexity of eachimplementation. The table also provides a relative indicationof the expected amount of drift error and the primary source ofthat error. With regard to the complexity of the architectures thatemploy inter-to-intra or intra-to-inter conversions, it should benoted that certain blocks involved in the conversion process arenot invoked for every macroblock. For instance, the DCT blockin the Intra_Refresharchitecture is only used for blocks thatrequire a conversion, which may be only a fraction of the totalnumber of blocks in a given frame, e.g., 10%. This is especiallyimportant for software implementations that rely mainly onaverageprocessing times. Also, in our simulations, theInterIntraconfiguration for the mixed-block processor has been adoptedfor theDriftLow andDriftFull architectures since this leads to thebest quality. The numbers in Table I take this configuration intoaccount and will be lower if theZeroOutconfiguration is used.

The tests are conducted with various input sequences at dif-ferent resolutions, formats, and encoding pattern (GOP struc-ture). All the sequences are coded at 30 frames/s. In theIntra-Refreshsimulations, we set initial thresholds: and

. During transcoding, both thresholds are adaptivelyadjusted as described in Section V. For comparison, we also sim-ulate anOptimal architecture, which consists of an MPEG-2decoder, followed by spatial-domain subsampling, and a fullMPEG-4 encoder with motion reestimation. This surely has thebest quality, but has at least 37 times more computational costthan theReference[7].


TABLE IIEXPERIMENTAL RESULTS FORAKIYO

We first conduct some experiments using input bitstreamsthat have been encoded with periodic I-frames. The primarypurpose of this experiment is to analyze the quality and com-plexity of all the presented architectures under common con-ditions that are somewhat tolerant to the accumulation of drift,but not completely forgiving. The original sequence is CIF res-olution (352 288) and coded as an MPEG-1 bitstream with

and .In the first experiment, we consider theAkiyo sequence,

which has low motion and a low level of detail. The sourcebit-rate is 512 kbps and the target rates for the transcoder are 32,64, and 96 kbps. The simulation shows that even the simplestopen-loop architectureZeroOutcan achieve reasonably goodquality and higher quality can be reached by other architectureswith higher complexity. Table II shows processing time inseconds, as well as the mean and variance of the frame-basedPSNR values (in decibels) for all the architectures. We cansee from the table thatInterIntra, IntraInter, IntraRefresh,Partial_Encode, and DriftLow have comparable quality withthe Reference. Also, for these architectures, we have recordedspeed-up factors of 3.59, 3.59, 3.4, 1.68, and 1.36, respectively,in comparison to theReference.

For the second experiment, theForemansequence, which hasa medium amount of motion and a medium level of detail, is con-sidered. The original bit rate is 2 Mbps. The target rates for thetranscoder are 128, 384, and 512 kbps. Through simulation, weobserved that artifacts can be found inZeroout, IntraInter, Inter-Intra, andDriftLow. We believe these artifacts are the result ofbecoming more dominant inwith increased motion. However,suchartifactswerenotobservedinIntra_RefreshandPartial_En-code.For therangeofbit rates tested, thesetwoarchitectureswereable to achieve similar quality to theReference, while decreasingthe time by a factor of 1.7 and 1.4, respectively. We note that thespeed-up forIntra_Refreshhas decreased in this simulation com-pared to the simulation withAkiyo. This is due to the increasedneed in Foreman for inter-to-intra conversion, which mainly re-quires an added DCT for each converted block. Summarized re-sults for all the other architectures are provided in Table III.

To truly test the drift compensation capabilities of theIntra_Refreshand Partial_Encodearchitectures, we test twosequences that have been coded with and .

TABLE IIIEXPERIMENTAL RESULTS FORFOREMAN

The first sequence isForemanwith CIF resolution and codedas MPEG-1 bitstream at a bit rate of 2 Mbps. The target ratesfor the transcoder are 1 Mbps, 512 kbps, and 256 kbps. Thesecond sequence isFootball, which has fast motion and a highlevel of detail. The original sequence is a CCIR601 format(720 480, interlace) and coded as an MPEG-2 bitstream at arate of 6 Mbps. The target rates for the transcoder are 3 Mbps,1.5 Mbps, and 750 kbps. Figs. 8 and 9 provide frame-basedpeak SNR (PSNR) plots of the transcoded results using selectarchitectures. From these plots, we can see thatIntra_Refreshcompensates for drift quite well in both sequences, even withsuch a long series of successive predictions. However, theresults are not so consistent forPartial_Encode, i.e., a strongdecay over time is observed forForemanand almost no decayfor Football. There are two explanations for this occurrence.The first reason is that forForeman plays a more criticalrole as increased, while forFootball is always moreinfluential than due to the high motion. The second reasonis due to the large number of intrablocks in P-frames that wereused to encodeFootball, which also relates to the high motioncomplexity in the sequence. This result indicates that driftpropagates less in such sequences.

From the experiments overall, we notice thatDriftFull withInterIntra is even more complex than theReference. Therefore,this architecture is not recommended for any application. Forsimple sequences with low motion, low level of detail, lowbit rate, short group of pictures (GOP), and small frame size,ZeroOutcan be used to get reasonably good quality. If higherquality is required, other architectures, such asInterIntra,IntraInter, IntraRefresh, Partial_Encode, andDriftLow shouldbe considered. For sequences with medium to fast motion,artifacts can be found inZeroout, IntraInter, InterIntra, andDriftLow. Compared to theReference, only Intra_Refreshcanachieve similar quality with less complexity.Partial_Encodecan only be used in sequences with a short GOP, or sequenceswith a long GOP but with a high amount of intra-codedblocks.

Through the simulations, we observe that theIntra_Refresharchitecture can achieve quality comparable to theReference,especially for sequences with larger motion, long GOP, andlarge frame size. As the scene complexity increases, we haveobserved that the speed-up over theReferencedecreases,


(a)

(b)

(c)

Fig. 8. Experimental result for reduce-spatial-resolution transcoding by 2 withvarious output rate forForeman: (a)N = 100,M = 1, 256 kbps, (b)N = 100,M = 1, 512 kbps, and (c)N = 100,M = 1, 1000 kbps.

i.e., by a factor of 3.4 forAkiyo, 1.7 for Foreman, and 1.44for Football. These results demonstrate the flexibility of theIntra_Refresharchitecture and the balance that it provides interms of quality and complexity.

(a)

(b)

(c)

Fig. 9. Experimental result for reduce-spatial-resolution transcoding by twowith various output rate forFootball: (a)N = 100,M = 1, 750 kbps, (b)N =

100,M = 1, 5 Mbps, and (c)N = 100,M = 1, 3 Mbps.

VII. SUMMARY AND FUTURE WORK

This paper presented several architectures and methods for re-duced-resolution transcoding. Through an analysis of the drift


error for this type of transcoding, we have identified two distinctsources of drift: one type due to errors in the reference picture,which have been studied in other works and arise mainly fromrequantization errors, and a second type that is due to the non-commutative property of motion compensation and down-sam-pling. With these sources of error in mind, four drift-compen-sating architectures that attempt to approximate the quality ofthe reference and reduce its complexity have been proposed.The first architecture,DriftLow, attempted to compensate forthe reference picture error in the reduced resolution, while thesecond architecture,DriftFull , attempted to do the same in theoriginal resolution. The third architecture,Partial_Encodetar-geted the second type of drift error only and the final architec-ture,Intra_Refresh, relied on a novel intrablock refresh methodto partially compensate for all types of errors.

From the experiments we have conducted, we believe thatthe Intra_Refresharchitecture offers the best tradeoff betweenquality and complexity. In most cases, the quality provided bythis transcoding method is very similar to that of the reference.It is also a flexible and adaptable architecture that can easily bescaled in terms of its complexity and quality to meet any numberof system needs.

With regard to future work, a better rate control model tocombine the intra-refresh process with the quantizer selectionrequires more investigation. Another interesting topic is to com-bine temporal resolution reduction with spatial resolution reduc-tion for video transcoding. Finally, we think there are some in-teresting studies to be done regarding the relation between MVmapping and texture down-sampling, especially within the con-text of interlace-to-progressive transcoding.

REFERENCES

[1] H. Sun, W. Kwok, and J. Zdepski, “Architectures for MPEG compressedbitstream scaling,”IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp.191–199, Apr. 1996.

[2] Y. Nakajima, H. Hori, and T. Kanoh, “Rate conversion of MPEG codedvideo by re-quantization process,” inProc. IEEE Int. Conf. Image Pro-cessing, vol. 3, Washington, DC, Oct. 1995, pp. 408–411.

[3] N. Yeadon, F. Garcia, D. Hutchison, and D. Shepherd, “Continuousmedia filters for heterogeneous internetworking,”Proc. SPIE Multi-media Computing & Networking, pp. 246–257, 1996.

[4] P. Assuncao and M. Ghanbari, “A frequency-domain video transcoderfor dynamic bit-rate reduction of MPEG-2 bit streams,”IEEE Trans.Circuits Syst. Video Technol., vol. 8, pp. 943–967, Dec. 1998.

[5] M. Yong, Q. F. Zhu, and V. Eyuboglu, “VBR transport of CBR-encodedvideo over ATM networks,” inProc. Sixth Int. Packet Video Workshop,Sept. 1994.

[6] J. Youn, M. T. Sun, and C. W. Lin, “Motion vector refinement for highperformance transcoding,”IEEE Trans. Multimedia, vol. 1, pp. 30–40,Mar. 1999.

[7] B. Shen, I. K. Sethi, and V. Bhaskaran, “Adaptive motion vector re-sampling for compressed video down-scaling,” inProc. IEEE Int. Conf.Image Processing, Santa Barbara, CA, Oct. 1997, pp. 771–774.

[8] W. Zhu, K. H. Yang, and M. J. Beacken, “CIF-to-QCIF video bitstreamdonw-conversion in the DCT domain,”Bell Labs Tech. J., vol. 3, no. 3,July–Sept. 1998.

[9] T. Shanableh and M. Ghanbari, “Heterogeous video transcoding tolower spatio-temporal resolutions and different encoding formats,”IEEE Trans. Multimedia, vol. 2, pp. 101–110, June 2000.

[10] , “Transcoding architectures for DCT-domain heterogeneous videotranscoding,” inProc. IEEE Int. Conf. Image Processing, vol. 1, Thes-saloniki, Greece, Oct. 2001, pp. 433–436.

[11] R. Mokry and D. Anastassiou, “Minimal error drift in frequency scala-bility for motion-compensated DCT coding,”IEEE Trans. Circuits Syst.Video Technol., vol. 4, pp. 392–406, Aug. 1994.

[12] P. Yin, M. Wu, and B. Liu, “Video transcoding by reducing spatial res-olution,” in Proc. IEEE Int. Conf. Image Processing, Vancouver, BC,Canada, Oct. 2000, pp. 972–975.

[13] S. J. Wee, J. G. Apostolopoulos, and N. Feamster, “Field-to-frametranscoding with spatial and temporal downsampling,” inProc. IEEEInt. Conf. Image Processing, Kobe, Japan, Oct. 1999, pp. 271–275.

[14] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architecturesfor three-layer scalable DTV decodig,”IEEE Trans. Consumer Elec-tron., vol. 44, pp. 527–536, Aug. 1998.

[15] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, “Analysis of videotransmission over lossy channels,”IEEE J. Select Areas Commun., vol.18, pp. 1012–1032, June 2000.

[16] T. Chiang and Y.-Q. Zhang, “A new rate control scheme using quadraticrate-distortion modeling,”IEEE Trans. Circuits Syst. Video Technol.,vol. 7, pp. 12–20, Feb. 1997.

[17] A. Vetro, H. Sun, and Y. Wang, “Object-based transcoding for adaptablevideo content delivery,”IEEE Trans. Circuits Syst. Video Technol., vol.11, pp. 387–401, Mar. 2001.

Peng Yin (S’99–M’02) received the B.E. degree in electrical engineering fromthe University of Science and Technology of China in 1996 and the M.A. andPh.D. degrees in electrical engineering from Princeton University, Princeton,NJ, in 1999 and 2002, respectively.

She is currently a Researcher in Corporate Research, Thomson MultimediaInc., Princeton, NJ. Her research interest includes image/video coding and pro-cessing for multimedia communication.

Anthony Vetro (S’92–M’96) received the B.S., M.S., and Ph.D. degrees inElectrical Engineering from Polytechnic University, Brooklyn, NY.

He joined Mitsubishi Electric Research Laboratories, Murray Hill, NJ, in1996, and is currently a Senior Principal Member of the Technical Staff. Uponjoining Mitsubishi, he worked on algorithms for down-conversion decoding andwas later involved in the development of a single-chip HDTV receiver that em-ployed these algorithms. More recently, his work has focused on the encodingand transport of multimedia content, with emphasis on video transcoding, rate-distortion modeling and optimal bit allocation. He has published more than 50papers in these areas and holds seven U.S. patents. Since 1997, he has been anactive participant in MPEG, contributing to the development of the MPEG-4and MPEG-7 standards. He is now an editor for Part 7 of the MPEG-21 stan-dard, Digital Item Adaptation.

Dr. Vetro has been a member of the Technical Program Committee for theInternational Conference on Consumer Electronics since 1998, serving as Pub-licity Chair in 1999 and 2000, and Tutorials Chair in 2002. He serves on theAdCom of the IEEE Consumer Electronics Society and on the PublicationsCommittee of the IEEE Transactions on Consumer Electronics. He is a memberof the Technical Committee on Visual Signal Processing and Communicationsof the IEEE Circuits and Systems Society and a member of the Editorial Boardfor theJournal of VLSI Signal Processing Systems.

Bede Liu (S’55–M’62–F’72) was born in China. He received the B.S. degreefrom National Taiwan University, Taipei, Taiwan, R.O.C., and the M.S. andPh.D. degrees from the Polytechnic Institute of Brooklyn, Brooklyn, NY, allin electrical engineering.

He is a Professor of Electrical Engineering at Princeton University, Princeton,NJ. Prior to joining Princeton University in 1962, he had been with Bell Labo-ratories, Allen B. DuMont Laboratory, and Western Electric Company. He hasalso been a visiting faculty member at several universities in the U.S. and abroad.

His current research interest lies mostly with multimedia technology, withparticular emphasis on digital watermarking and video processing.

Prof. Liu and his students have twice received the best paper awards on videotechnology (1994 and 1996). His other IEEE awards include the CentennialMedal (1984) and the Millennium Medal (2000), the IEEE Signal ProcessingSociety’s Technical Achievement Award (1985) and Society Award (2000),the IEEE Circuit and Systems Society’s Education Award (1988), the MacVan Valkenburg Award (1997), and the Golden Jubilee Medals (2000). He isa member of the National Academy of Engineering. He was the President ofthe IEEE Circuit and Systems Society (1982) and was the IEEE Division IDirector (1984, 1985). He also served as the 1978 ISCAS Technical ProgramChair and the General Chair of the 1995 ICIP.


Huifang Sun (S’83–M’85–SM’93–F’00) graduated from Harbin Military En-gineering Institute, Harbin, China. He received the Ph.D. degree from the Uni-versity of Ottawa, ON, Canada.

He joined the Electrical Engineering Department of Fairleigh Dickinson Uni-versity as an Assistant Professor in 1986 and was promoted to an Associate Pro-fessor. He moved to Sarnoff Corporation, Princeton, NJ, in 1990 as a Memberof the Technical Staff and was later promoted to a Technology Leader of DigitalVideo Communications. In 1995, he joined Mitsubishi Electric Research Labo-ratories, Murray Hill, NJ, where he now holds the position of of Deputy Director.His research interests include digital video/image compression and digital com-munication. He has published more than 120 journal and conference papers. Heholds 18 U.S. patents and has more pending.

Dr. Sun received the AD-HDTV Team Award in 1992 and Technical Achieve-ment Award for optimization and specification of the Grand Alliance HDTVvideo compression algorithm in 1994 at Sarnoff Labs. He received the bestpaper award of 1992 IEEE Transaction on Consumer Electronics and Best paperaward of 1996 ICCE. He is now an Associate Editor for IEEE TRANSACTION ON

CIRCUITS AND SYSTEMS FORVIDEO TECHNOLOGYand the Chair of Visual Pro-cessing Technical Committee of IEEE Circuits and System Society.

Documents

Drift compensation for reduced spatial resolution transcoding