Video Compression—From Concepts to the H.264/AVC …ip.hhi.de/imagecom_G1/assets/pdfs/pieee_sullivan_wiegand_2005.pdf · Video Compression—From Concepts to the ... The rate-distortion

Video Compression—From Concepts to theH.264/AVC Standard

GARY J. SULLIVAN, SENIOR MEMBER, IEEE AND THOMAS WIEGAND

Invited Paper

Over the last one and a half decades, digital video compressiontechnologies have become an integral part of the way we create,communicate, and consume visual information. In this paper,techniques for video compression are reviewed, starting frombasic concepts. The rate-distortion performance of modern videocompression schemes is the result of an interaction between motionrepresentation techniques, intra-picture prediction techniques,waveform coding of differences, and waveform coding of variousrefreshed regions. The paper starts with an explanation of the basicconcepts of video codec design and then explains how these variousfeatures have been integrated into international standards, up toand including the most recent such standard, known as H.264/AVC.

Keywords—Advanced Video Coding (AVC), compression,H.264, H.26x, Joint Video Team (JVT), Moving Picture ExpertsGroup (MPEG), MPEG-4, standards, video, video coding, videocompression, Video Coding Experts Group (VCEG).

I. INTRODUCTION

Digital video communication can be found today in manyapplication scenarios, such as:

• broadcast, subscription, and pay-per-view servicesover satellite, cable, and terrestrial transmission chan-nels (e.g., using H.222.0 MPEG-2 systems [1]);

• wire-line and wireless real-time conversational ser-vices (e.g., using H.32x [2]–[4] or Session InitiationProtocol (SIP) [5]);

• Internet or local area network (LAN) video streaming(using Real-Time Protocol/Internet Protocol (RTP/IP)[6]);

• storage formats (e.g., digital versatile disk (DVD), dig-ital camcorders, and personal video recorders).

The basic communication problem may be posed as con-veying source data with the highest fidelity possible withinan available bit rate, or it may be posed as conveying the

Manuscript received February 19, 2004; revised July 30, 2004.G. J. Sullivan is with the Microsoft Corporation, Redmond, WA 98052

USA (e-mail: [email protected]).T. Wiegand is with the Heinrich Hertz Institute (FhG), Berlin D 10587,

Germany (e-mail: [email protected]).Digital Object Identifier 10.1109/JPROC.2004.839617

source data using the lowest bit rate possible while main-taining a specified reproduction fidelity [7]. In either case, afundamental tradeoff is made between bit rate and fidelity.The ability of a source coding system to make this tradeoffwell is called its coding efficiency or rate-distortion perfor-mance, and the coding system itself is referred to as a codec(i.e., a system comprising a coder and a decoder).

Thus, video codecs are primarily characterized in terms of:

• Throughput of the channel: a characteristic influencedby the transmission channel bit rate and the amount ofprotocol and error-correction coding overhead incurredby the transmission system.

• Distortion of the decoded video: distortion is primarilyinduced by the video codec and by channel errors in-troduced in the path to the video decoder.

However, in practical video transmission systems the fol-lowing additional issues must be considered as well.

• Delay (startup latency and end-to-end delay): Delaycharacteristics are influenced by many parameters, in-cluding processing delay, buffering, structural delays ofvideo and channel codecs, and the speed at which dataare conveyed through the transmission channel.

• Complexity (in terms of computation, memory capacity,and memory access requirements). The complexity ofthe video codec, protocol stacks, and network.

Hence, the practical source coding design problem isposed as follows: given a maximum allowed delay and amaximum allowed complexity, achieve an optimal tradeoffbetween bit rate and distortion for the range of networkenvironments envisioned in the scope of the applications.

The various application scenarios of video communica-tion show very different optimum working points, and theseworking points have shifted over time as the constraints oncomplexity have been eased by Moore’s law and as higherbit-rate channels have become available. In this paper, we ex-amine the video codec design problem and the evolution ofits solutions up to the latest international standard known asH.264 or MPEG-4 Advanced Video Coding (H.264/AVC).

0018-9219/$20.00 © 2005 IEEE

18 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 1, JANUARY 2005

II. VIDEO SOURCE CODING BASICS

A digital image or a frame of digital video typically con-sists of three rectangular arrays of integer-valued samples,one array for each of the three components of a tristimuluscolor representation for the spatial area represented in theimage. Video coding often uses a color representation havingthree components called Y, Cb, and Cr. Component Y iscalled luma and represents brightness. The two chroma com-ponents Cb and Cr represent the extent to which the colordeviates from gray toward blue and red, respectively.1 Be-cause the human visual system is more sensitive to lumathan chroma, often a sampling structure is used in which thechroma component arrays each have only one-fourth as manysamples as the corresponding luma component array (half thenumber of samples in both the horizontal and vertical dimen-sions). This is called 4:2:0 sampling. The amplitude of eachcomponent is typically represented with 8 b of precision persample for consumer-quality video.

The two basic video formats are progressive and inter-laced. A frame array of video samples can be considered tocontain two interleaved fields, a top field and a bottom field.The top field contains the even-numbered rows

(with 0 being top row number for a frame and beingits total number of rows), and the bottom field contains theodd-numbered rows (starting with the secondrow of the frame). When interlacing is used, rather than cap-turing the entire frame at each sampling time, only one ofthe two fields is captured. Thus, two sampling periods arerequired to capture each full frame of video. We will use theterm picture to refer to either a frame or field. If the two fieldsof a frame are captured at different time instants, the frame isreferred to as an interlaced frame, and otherwise it is referredto as a progressive frame.

Techniques for digital compression for many applicationscan typically be classified as follows.

• Prediction: A process by which a set of predictionvalues is created (often based in part on an indicationsent by an encoder of how to form the predictionbased on analysis of the input samples and the types ofprediction that can be selected in the system design)that is used to predict the values of the input samplesso that the values that need to be represented becomeonly the (typically easier to encode) differences fromthe predicted values, such differences being called theresidual values.

• Transformation: A process (also referred to as sub-band decomposition) that is closely related to predic-tion, consisting of forming a new set of samples froma combination of input samples, often using a linearcombination. Simplistically speaking, a transformationcan prevent the need to repeatedly represent similarvalues and can capture the essence of the input signal byusing frequency analysis. A typical benefit of transfor-

1The terms luma and chroma are used here (and in H.264/AVC) ratherthan the terms luminance and chrominance, in order to avoid the implicationof the use of linear light transfer characteristics that is often associated withthe other terms.

mation is a reduction in the statistical correlation of theinput samples, so that the most relevant aspects of theset of input samples are typically concentrated into asmall number of variables. Two well-known examplesof transformation are the Karhunen-Loève transform(KLT), which is an optimal decorrelator, and the dis-crete cosine transform (DCT), which has performanceclose to that of a KLT when applied to highly correlatedauto-regressive sources.

• Quantization: A process by which the precision usedfor the representation of a sample value (or a group ofsample values) is reduced in order to reduce the amountof data needed to encode the representation. Such aprocess is directly analogous to intuitively well-under-stood concepts such as the rounding off of less signif-icant digits when writing the value of some statistic.Often the rounding precision is controlled by a stepsize that specifies the smallest representable value in-crement. Among the techniques listed here for com-pression, quantization is typically the only one that isinherently noninvertible—that is, quantization involvessome form of many-to-few mapping that inherently in-volves some loss of fidelity. The challenge is to min-imize that loss of fidelity in relation to some relevantmethod of measuring distortion.

• Entropy coding: A process by which discrete-valuedsource symbols are represented in a manner that takesadvantage of the relative probabilities of the variouspossible values of each source symbol. A well-knowntype of entropy code is the variable-length code(VLC), which involves establishing a tree-structuredcode table that uses short binary strings to representsymbol values that are highly likely to occur andlonger binary strings to represent less likely symbolvalues. The best-known method of designing VLCsis the well-known Huffman code method, which pro-duces an optimal VLC. A somewhat less well-knownmethod of entropy coding that can typically be moreoptimal than VLC coding and can also be more easilydesigned to adapt to varying symbol statistics is thenewer technique referred to as arithmetic coding.

One way of compressing video is simply to compresseach picture separately. This is how much of the compres-sion research started in the mid-1960s [9], [10]. Today, themost prevalent syntax for such use is JPEG (-1992) [11].The most common “baseline” JPEG scheme consists ofsegmenting the picture arrays into equal-size blocks of 8 8samples each. These blocks are transformed by a DCT [12],and the DCT coefficients are then quantized and transmittedusing variable-length codes. We refer to this kind of codingscheme as intra-picture or Intra coding, since the picture iscoded without referring to other pictures in a video sequence.In fact, such Intra coding (often called motion JPEG) is incommon use for video coding today in production-qualityediting systems.

However, improved compression performance can beachieved by taking advantage of the large amount of tem-poral redundancy in video content. This was recognized at

SULLIVAN AND WIEGAND: VIDEO COMPRESSION—FROM CONCEPTS TO THE H.264/AVC STANDARD 19

least as long ago as 1929 [13]. Usually, much of the depictedscene is essentially just repeated in picture after picturewithout any significant change, so video can be representedmore efficiently by sending only the changes in the videoscene rather than coding all regions repeatedly. We refer tosuch techniques as inter-picture or Inter coding. This abilityto use temporal redundancy to improve coding efficiency iswhat fundamentally distinguishes video compression fromthe Intra compression exemplified by JPEG standards. Ahistorical analysis of video coding can be found in [14].

A simple method of improving compression by codingonly the changes in a video scene is called conditional re-plenishment (CR) [15], and it was the only temporal redun-dancy reduction method used in the first version of the firstdigital video coding international standard, ITU-T Recom-mendation H.120 [16]. CR coding consists of sending signalsto indicate which areas of a picture can just be repeated, andsending new information to replace the changed areas. Thus,CR allows a choice between one of two modes of represen-tation for each area, which we call Skip and Intra. However,CR has a significant shortcoming, which is its inability to re-fine the approximation given by a repetition.

Often the content of an area of a prior picture can be agood starting approximation for the corresponding area in anew picture, but this approximation could benefit from someminor alteration to make it a better representation. Addinga third type of “prediction mode,” in which a refinementdifference approximation can be sent, results in a furtherimprovement of compression performance—leading to thebasic design of modern hybrid codecs (using a term coinedby Habibi [17] with a somewhat different original meaning).The naming of these codecs refers to their construction as ahybrid of two redundancy reduction techniques—using bothprediction and transformation. In modern hybrid codecs, re-gions can be predicted using inter-picture prediction, and aspatial frequency transform is applied to the refinement re-gions and the Intra-coded regions. The modern basic struc-ture was first standardized in ITU-T Recommendation H.261[18], and is used very similarly in its successors MPEG-1[19], H.262 MPEG-2 [20], H.263 [21], MPEG-4 Part 2 Vi-sual [22], and H.264/AVC [23].

One concept for the exploitation of statistical temporal de-pendencies that was missing in the first version of H.120[16] and in [17] was motion-compensated prediction (MCP).MCP dates to the early 1970s [24], and the way it is used inmodern video coding standards [18]–[23] was first widelypublished in [25]. MCP can be motivated as follows. Mostchanges in video content are typically due to the motion ofobjects in the depicted scene relative to the imaging plane,and a small amount of motion can result in large differencesin the values of the samples in a picture, especially near theedges of objects. Often, predicting an area of the current pic-ture from a region of the previous picture that is displaced bya few samples in spatial location can significantly reduce theneed for a refining difference approximation. This use of spa-tial displacement motion vectors (MVs) to form a predictionis known as motion compensation (MC), and the encoder’ssearch for the best MVs to use is known as motion estima-

tion (ME). The coding of the resulting difference signal forthe refinement of the MCP is known as MCP residual coding.

It should be noted that the subsequent improvement ofMCP techniques has been the major reason for coding ef-ficiency improvements achieved by modern standards whencomparing them from generation to generation. The price forthe use of MCP in ever more sophisticated ways is a majorincrease in complexity requirements. The primary steps for-ward in MCP that found their way into the H.264/AVC stan-dard were as follows.

• Fractional-sample-accurate MCP [26]. This termrefers to the use of spatial displacement MV valuesthat have more than integer precision, thus requiringthe use of interpolation when performing MCP. Atheoretical motivation for this can be found in [27]and [28]. Intuitive reasons include having a more ac-curate motion representation and greater flexibility inprediction filtering (as full-sample, half-sample, andquarter-sample interpolators provide different degreesof low-pass filtering which are chosen automaticallyin the ME process). Half-sample-accuracy MCP wasconsidered even during the design of H.261 but was notincluded due to the complexity limits of the time. Later,as processing power increased and algorithm designsimproved, video codec standards increased the preci-sion of MV support from full-sample to half-sample(in MPEG-1, MPEG-2, and H.263) to quarter-sample(for luma in MPEG-4’s advanced simple profile andH.264/AVC) and beyond (with eighth-sample accuracyused for chroma in H.264/AVC).

• MVs over picture boundaries [29], first standardized inH.263. The approach solves the problem for motionrepresentation for samples at the boundary of a pic-ture by extrapolating the reference picture. The mostcommon method is just to replicate the boundary sam-ples for extrapolation.

• Bipredictive MCP [30], i.e., the averaging of twoMCP signals. One prediction signal has typically beenformed from a picture in the temporal future with theother formed from the past relative to the picture beingpredicted (hence, it has often been called bidirectionalMCP). Bipredictive MCP was first put in a standardin MPEG-1, and it has been present in all other suc-ceeding standards. Intuitively, such bipredictive MCPparticularly helps when the scene contains uncoveredregions or smooth and consistent motion.

• Variable block size MCP [31], i.e., the ability to selectthe size of the region (ordinarily a rectangular block-shaped region) associated with each MV for MCP. In-tuitively, this provides the ability to effectively trade offthe accuracy of the motion field representation with thenumber of bits needed for representing MVs [41].

• Multipicture MCP [36], [37], i.e., MCP using morethan just one or two previous decoded pictures. This al-lows the exploitation of long-term statistical dependen-cies in video sequences, as found with backgrounds,scene cuts, and sampling aliasing.


Fig. 1. Hybrid video encoder (especially for H.264/AVC).

• Multihypothesis and weighted MCP [32]–[35], i.e., theconcept of linearly superimposed MCP signals. Thiscan be exploited in various ways, such as overlappedblock MC as in [32] and [33] (which is in H.263 but notH.264/AVC) and conventional bidirectional MCP. Thecombination of bidirectional MCP, multipicture MCP,and linearly weighted MCP can lead to a unified gen-eralization [34] as found in H.264/AVC. Even the in-terpolation process of fractional-sample-accurate MCPis a special case of multihypothesis MCP, as it uses alinear superposition of MCP signals from multiple in-teger MV offsets.

Natural video contains a wide variety of content with dif-ferent statistical behavior, even from region to region withinthe same picture. Therefore, a consistent strategy for im-proving coding efficiency has been to add coding modes tolocally adapt the processing for each individual part of eachpicture. Fig. 1 shows an example encoder for modern videocoding standards [18]–[23].

In summary, a hybrid video encoding algorithm typicallyproceeds as follows. Each picture is split into blocks. The firstpicture of a video sequence (or for a “clean” random accesspoint into a video sequence) is typically coded in Intra mode(which typically uses some prediction from region to regionwithin the picture but has no dependence on other pictures).For all remaining pictures of a sequence or between randomaccess points, typically inter-picture coding modes are usedfor most blocks. The encoding process for Inter prediction(ME) consists of choosing motion data comprising the se-lected reference picture and MV to be applied for all samplesof each block. The motion and mode decision data, which aretransmitted as side information, are used by the encoder anddecoder to generate identical Inter prediction signals usingMC.

The residual of the Intra or Inter prediction, which is thedifference between the original block and its prediction, istransformed by a frequency transform. The transform coef-

ficients are then scaled, quantized, entropy coded, and trans-mitted together with the prediction side information.

The encoder duplicates the decoder processing so thatboth will generate identical predictions for subsequent data.Therefore, the quantized transform coefficients are con-structed by inverse scaling and are then inverse transformedto duplicate the decoded prediction residual. The residualis then added to the prediction, and the result of that ad-dition may then be fed into a deblocking filter to smoothout block-edge discontinuities induced by the block-wiseprocessing. The final picture (which is also displayed bythe decoder) is then stored for the prediction of subsequentencoded pictures. In general, the order of the encodingor decoding processing of pictures often differs from theorder in which they arrive from the source, necessitating adistinction between the decoding order and the output orderfor a decoder.

The design and operation of an encoder involves theoptimization of many decisions to achieve the best possibletradeoff between rate and distortion given the constraintson delay and complexity. There has been a large amount ofwork on this optimization problem. One particular focus hasbeen on Lagrangian optimization methods [38]–[40]. Somestudies have developed advanced encoder optimizationstrategies with little regard for encoding complexity (e.g.,[41]–[51]), while others have focused on how to achieve areduction in complexity while losing as little as possible inrate-distortion performance.

Above we have described the major technical featuresof a modern video coder. An example of the effectivenessof these features and the dependence of this effectivenesson video content is shown in Fig. 2. The plot shows per-formance for a sequence known as Foreman, with heavyobject motion and an unstable hand-held moving camera.The sequence was encoded in common intermediate format(CIF) resolution (352 288 in luma with 4:2:0 sampling) at15 frames/s, using well-optimized H.263 and MPEG-4 part


Fig. 2. Effectiveness of basic technical features.

2 video encoders (using optimization methods describedin [51]). H.263 and MPEG-4 part 2 use 8 8 DCT-basedresidual coding and (as with all other standards starting withH.261) 16 16 prediction mode regions called macroblocks.

Gains in performance can be seen in Fig. 2 when addingvarious enhanced Inter coding modes to the encoder.

Case 1) The performance achieved by spatial-transformIntra coding only (e.g., as in JPEG coding).

Case 2) Adding Skip mode to form a CR coder.Case 3) Adding residual difference coding, but with only

zero-valued MVs.Case 4) Adding integer-precision MC with blocks of size

16 16 luma samples.Case 5) Adding half-sample-precision MC.Case 6) Allowing some 16 16 regions to be split into

four blocks of 8 8 luma samples each for MC.Case 7) Increasing MV precision to quarter-sample.The addition of more and more such cases must be done

carefully, or the complexity of selecting among them andthe amount of coded data necessary indicate that selectioncould exceed the benefit of having more choices available.The amount of benefit for each technique will also vary dra-matically for different video scene content.

III. VIDEO TRANSMISSION OVER ERROR-PRONE CHANNELS

In many cases, the errors of transmission channels canbe efficiently corrected by classical channel coding methodssuch as forward error correction (FEC) and automatic re-peat request (ARQ) or a mixture of them. This is achievedat the cost of reduced throughput and increased delay. Ap-plications that typically fall into this category are broadcast,streaming, and video mail, and most of the problems relatedto error-prone transmission channels do not affect the designof video codecs for these applications.

However, these channel coding techniques sometimes re-quire too much of a reduction in data throughput from thetransmission channel and add too much delay to provide anegligible bit-error and packet-loss rate for some applica-tions. Examples are video conferencing with its demandingdelay requirements, slow fading mobile channels, congestedInternet routers, and broadcast with varying coverage. There-fore, some amount of data losses or residual errors must oftenbe tolerated.

Fig. 3. Scope of video coding standardization.

However, when MCP is used in a hybrid video codec, datalosses can cause the reference pictures stored at the encoderand decoder to differ in that the encoder’s reference storagecontains the transmitted video pictures while the decoder’sreference storage contains corrupted or concealed content forthe parts of the pictures that are affected by the errors. MCPcan then cause the error to propagate to many subsequentlydecoded pictures. Because errors remain visible for muchlonger than a single picture display period, the resulting ar-tifacts are particularly annoying to viewers. Quick recoverycan only be achieved when picture regions are encoded inIntra mode or when Inter prediction is modified to ensurethat no reference is made to the parts of the reference pic-tures that differ.

The bitstream and its transport layer must provide frequentaccess points at which a decoder can restart its decodingprocess after some loss or corruption, and it can also be ben-eficial to separate more important data (such as header in-formation, prediction modes, MVs, and Intra data) from lessimportant data (such as the fine details of the Inter predic-tion residual representation) in the bitstream so that the moreimportant data can still be decoded when some of the lessimportant data has been lost. Providing greater protectionagainst losses of the more important parts of the data can alsobe beneficial.

Work in this area often focuses on modifying syntaxand encoder operation to minimize error propagation, orimproving the decoder’s ability to conceal errors. Recently,there has also been some work on changing the basic struc-ture of a low-delay video codec to using distributed coding(reviewed in [52]). Approaches to modify encoder operationeither concentrate on the use of Intra coding (e.g., [53]–[57])or modify MCP in Inter coding (e.g., [58]–[63]) or both(e.g., [37], [64]). Methods to improve error concealmentat the decoder have included approaches with and withoutdedicated side information (e.g., see [65]–[68]).

IV. VIDEO CODING STANDARDS

A typical video processing chain (excluding the transportor storage of the video signal) and the scope of the videocoding standardization are depicted in Fig. 3. For all ITU-Tand ISO/IEC JTC 1 video coding standards, only the cen-tral decoder is standardized. The standard defines a specificbitstream syntax, imposes very limited constraints on thevalues of that syntax, and defines a limited-scope decodingprocess. The intent is for every decoder that conforms to thestandard to produce similar output when given a bitstreamthat conforms to the specified constraints. Thus, these videocoding standards are written primarily only to ensure interop-erability (and syntax capability), not to ensure quality. This


limitation of scope permits maximal freedom to optimizethe design of each specific product (balancing compressionquality, implementation cost, time to market, etc.). It pro-vides no guarantees of end-to-end reproduction quality, asit allows even crude encoding methods to be considered inconformance with the standard.

V. H.264/AVC VIDEO CODING STANDARD

To address the requirement of flexibility and customiz-ability to various applications, the H.264/AVC [23], [69] de-sign covers a video coding layer (VCL), which is designedto efficiently represent the video content, and a network ab-straction layer (NAL), which formats the VCL representationof the video and provides header information to package thatdata for network transport.

A. H.264/AVC NAL

The NAL is designed to enable simple and effective cus-tomization of the use of the VCL for a broad variety of sys-tems. The full degree of customization of the video contentto fit the needs of each particular application is outside thescope of the H.264/AVC standard itself, but the design of theNAL anticipates a variety of such mappings.

Some key building blocks of the NAL design are NALunits, parameter sets, and access units. A short descriptionof these concepts is given below, with more detail includingerror resilience aspects provided in [70] and [71].

1) NAL Units: The coded video data is organized intoNAL units, each of which is effectively a packet that con-tains an integer number of bytes. The first byte of each NALunit is a header byte that contains an indication of the type ofdata in the NAL unit, and the remaining bytes contain pay-load data of the type indicated by the header.

Some systems (e.g., H.320 and H.222.0 MPEG-2 sys-tems) require delivery of the entire or partial stream of NALunits as an ordered stream of bytes or bits. For use in suchsystems, H.264/AVC specifies a byte stream format, whereeach NAL unit is prefixed by a specific pattern of three bytescalled a start code prefix which can be uniquely identified inthe byte stream. A finite-state machine prevents accidentalemulation of start code prefixes. In other systems (e.g.,RTP/IP systems), the coded data is carried in packets that areframed by the system transport protocol, and identificationof the boundaries of NAL units within the transport packetscan be established without use of start code prefix patterns.

There are two classes of NAL units, called VCL andnon-VCL NAL units. The VCL NAL units contain the datathat represents the values of the samples in the video pic-tures, and the non-VCL NAL units contain all other relatedinformation such as parameter sets (important header datathat can apply to a large number of VCL NAL units) andsupplemental enhancement information (timing informationand other supplemental data that may enhance usability ofthe decoded video signal but are not necessary for decodingthe values of the samples in the video pictures).

2) Parameter Sets: A parameter set contains importantheader information that can apply to a large number of VCLNAL units. There are two types of parameter sets:

Fig. 4. Parameter set use with reliable “out-of-band” parameterset exchange.

• sequence parameter sets, which apply to a series of con-secutive coded video pictures;

• picture parameter sets, which apply to the decoding ofone or more individual pictures.

Key VCL NAL units for a picture each contain an identifierthat refers to the content of the relevant picture parameterset, and each picture parameter set contains an identifier thatrefers to the relevant sequence parameter set. In this manner,a small amount of data (the identifier) can be used to establisha larger amount of information (the parameter set) withoutrepeating that information within each VCL NAL unit.

The sequence and picture parameter set mechanism decou-ples the transmission of infrequently changing informationfrom the transmission of coded representations of the valuesof the samples in the video pictures. This design for extrarobustness for parameter sets is especially important, as theloss of certain syntax elements can have a catastrophic im-pact on the ability to decode the video.

Sequence and picture parameter sets can be sent wellahead of the VCL NAL units that they apply to, and canbe repeated to provide robustness against data loss. Insome applications, parameter sets may be sent within thechannel that carries the VCL NAL units (termed “in-band”transmission). In other applications (see Fig. 4) it can beadvantageous to convey the parameter sets “out-of-band”using a more reliable transport mechanism.

3) Access Units: The set of VCL and non-VCL NALunits that is associated with a single decoded picture isreferred to as an access unit. The access unit contains allmacroblocks of the picture, possibly some redundant ap-proximations of some parts of the picture for error resiliencepurposes (referred to as redundant slices), and other supple-mental information associated with the picture.B. H.264/AVC VCL

As in all prior ITU-T and ISO/IEC JTC 1 video stan-dards since H.261 [18], the VCL design follows the so-calledblock-based hybrid video coding approach (as depicted inFig. 1). There is no single coding element in the VCL thatprovides the majority of the significant improvement in com-pression efficiency in relation to prior video coding stan-dards. It is rather a plurality of smaller improvements thatadd up to the significant gain. A more detailed description ofthe VCL design is given below.

1) Macroblocks, Slices, and Slice Groups: A codedvideo sequence in H.264/AVC consists of a sequence ofcoded pictures. Each picture is partitioned into fixed size


Fig. 5. Subdivision of a picture into slices (when not using FMO).

macroblocks that each contain a rectangular picture areaof 16 16 samples for the luma component and the corre-sponding 8 8 sample regions for each of the two chromacomponents. Macroblocks are the basic building blocksfor which the decoding process is specified. The luma andchroma samples of a macroblock are predicted—either spa-tially or temporally—and the resulting prediction residual istransmitted using transform coding. Each color componentof the residual is subdivided into blocks, each block istransformed using an integer transform, and the transformcoefficients are quantized and entropy coded.

The macroblocks of the picture are organized into slices,which represent regions of a given picture that can bedecoded independently. Each slice is a sequence of mac-roblocks that is processed in the order of a raster scan, i.e.,a scan from top-left to bottom-right, (although they are notnecessarily always consecutive in the raster scan, as de-scribed below for the flexible macroblock ordering (FMO)feature). A picture may contain one or more slices (forexample, as shown in Fig. 5). Each slice is self-contained,in the sense that, given the active sequence and pictureparameter sets, its syntax elements can be parsed from thebitstream and the values of the samples in the area of thepicture that the slice represents can basically be decodedwithout use of data from other slices of the picture (providedthat all previously decoded reference pictures are identicalat encoder and decoder for use in MCP). However, forcompletely exact decoding, some information from otherslices may be needed in order to apply the deblocking filteracross slice boundaries. Slices can be used for:

• error resilience, as the partitioning of the picture al-lows spatial concealment within the picture and as thestart of each slice provides a resynchronization point atwhich the decoding process can be reinitialized;

• creating well-segmented payloads for packets that fitthe maximum transfer unit (MTU) size of a network(e.g., MTU size is 1500 B for Ethernet);

• parallel processing, as each slice can be encoded anddecoded independently of the other slices of the picture.

The error resilience aspect of slices can be further en-hanced (among other uses) through the use of the FMO tech-nique, which modifies the way macroblocks are associatedwith slices. Using FMO, a picture can be split into manymacroblock scanning patterns such as interleaved slices, adispersed macroblock allocation, one or more “foreground”slice groups and a “leftover” slice group, or a checkerboard

mapping. For more details on the use of FMO, see [70]; con-cealment techniques for FMO are exemplified in [71].

Since each slice of a picture can be decoded independentlyof the others, no specific ordering of the decoding for thevarious slices of a picture is strictly necessary. This givesrise to a concept closely related to FMO that can be used forloss robustness and delay reduction, which is arbitrary sliceordering (ASO). When ASO is in use, the slices of a picturecan be in any relative order in the bitstream, and when it isnot, the slices must be ordered such that the first macroblockin each subsequent slice is increasing in the order of a rasterscan within the picture.

Loss robustness can also be enhanced by separating moreimportant data (such as macroblock types and MV values)from less important data (such as inter residual transform co-efficient values) and reflecting data dependencies and impor-tance by using separate NAL unit packets for data of differentcategories. This is referred to as data partitioning.

Further loss robustness can be provided by sending du-plicative coded representations of some or all parts of thepicture. These are referred to as redundant slices.

2) Slice Types: There are five fundamental slice types.

• I slice: A slice in which all macroblocks of the slice arecoded using Intra prediction.

• P slice: In addition to the coding types of the I slice,macroblocks of a P slice can also be coded using Interprediction with at most one MCP signal per block.

• B slice: In addition to the coding types available in a Pslice, macroblocks of a B slice can also be coded usingInter prediction with two MCP signals per predictionblock that are combined using a weighted average.

• SP slice: A so-called switching P slice that is codedsuch that efficient and exact switching between dif-ferent video streams (or efficient jumping from place toplace within a single stream) becomes possible withoutthe large number of bits needed for an I slice.

• SI slice: A so-called switching I slice that allows anexact match with an SP slice for random access or errorrecovery purposes, while using only Intra prediction.

The first three slice types listed above are very similar tocoding methods used in previous standards, with the excep-tion of the use of reference pictures as described below. Theother two types are new. For details on the novel concept ofSP and SI slices, the reader is referred to [72]; the other slicetypes are further described below.

3) Intra-Picture Prediction: In all slice-coding types,two primary types of Intra coding are supported: Intra 4 4and Intra 16 16 prediction. Chroma Intra prediction is thesame in both cases. A third type of Intra coding, calledI PCM, is also provided for use in unusual situations.

The Intra 4 4 mode is based on predicting each 4 4luma block separately and is well suited for coding of partsof a picture with significant detail. The Intra 16 16 mode,on the other hand, does prediction and residual coding on theentire 16 16 luma block and is more suited for coding verysmooth areas of a picture. In addition to these two types ofluma prediction, a separate chroma prediction is conducted.


Fig. 6. Left: Intra 4�4 prediction is conducted for samples a-pusing samples A-M. Right: Eight selectable “prediction directions”for Intra 4�4.

In contrast to previous video coding standards (especiallyH.263 and MPEG-4 Visual), where Intra prediction hasbeen conducted in the transform domain, Intra predictionin H.264/AVC is always conducted in the spatial domain,by referring to neighboring samples of previously decodedblocks that are to the left and/or above the block to bepredicted. Since this can result in spatio-temporal errorpropagation when Inter prediction has been used for neigh-boring macroblocks, a constrained Intra coding mode canalternatively be selected that allows prediction only fromIntra-coded neighboring macroblocks.

In Intra 4 4 mode, each 4 4 luma block is predictedfrom spatially neighboring samples as illustrated on theleft-hand side of Fig. 6. The 16 samples of the 4 4 block,marked a–p, are predicted using position-specific linearcombinations of previously decoded samples, marked A–M,from adjacent blocks. The encoder can select either “DC”prediction (called mode 2, where an average value is usedto predict the entire block) or one of eight directional pre-diction types illustrated on the right-hand side of Fig. 6. Thedirectional modes are designed to model object edges atvarious angles.

In Intra 16 16 mode, the whole 16 16 luma componentof the macroblock is predicted at once, and only four pre-diction modes are supported: vertical, horizontal, DC, andplane. The first three are similar to the modes in Intra 4 4prediction except for increasing the number of samples to re-flect the larger block size. Plane prediction uses position-spe-cific linear combinations that effectively model the predictedblock as a plane with an approximate fit for the horizontaland vertical variation along the block edges.

The chroma samples of an Intra macroblock are predictedusing similar prediction techniques as for the luma compo-nent in Intra 16 16 macroblocks.

For the I PCM Intra macroblock type, no prediction is per-formed and the raw values of the samples are simply sentwithout compression. This mode is primarily included for de-coder implementation reasons, as it ensures that the numberof bits needed for any macroblock will never need to be muchlarger than the size of an uncompressed macroblock, regard-less of the quantization step size and the values of the par-ticular macroblock samples. As a side benefit, it also enableslossless coding of selected regions.

4) Inter-Picture Prediction: Inter-Picture Predictionin P Slices: Various “predictive” or motion-compensatedcoding types are specified as P macroblock types. P mac-roblocks can be partitioned into smaller regions for MCP

Fig. 7. Segmentations of the macroblock for MC. Top:segmentation of macroblocks, bottom: segmentation of8� 8 partitions.

Fig. 8. Multipicture MCP. In addition to the MV, referenceindexes (�) are transmitted. The concept is similarly extendedfor B slices.

with luma block sizes of 16 16, 16 8, 8 16, and 8 8samples. When 8 8 macroblock partitioning is chosen,an additional syntax element is transmitted for each 8 8partition, which specifies whether the 8 8 partition isfurther partitioned into smaller regions of 8 4, 4 8, or4 4 luma samples and corresponding chroma samples(see Fig. 7). The prediction signal for each predictive-coded

luma block is obtained by MC, which is specifiedby a translational MV and a picture reference index. Thesyntax allows MVs to point over picture boundaries.

The accuracy of MC is in units of one-quarter of the hori-zontal or vertical distance between luma samples. If the MVpoints to an integer-sample position, the prediction signalconsists of the corresponding samples of the reference pic-ture; otherwise, the corresponding sample is obtained usinginterpolation. The prediction values at half-sample positionsare obtained by applying a one-dimensional six-tap finite im-pulse response (FIR) filter horizontally and/or vertically. Pre-diction values at quarter-sample positions are generated byaveraging two samples at integer- and half-sample positions.For further analysis, refer to [73].

The MV values are differentially coded using either me-dian or directional prediction from neighboring blocks. NoMV value prediction (or any other form of prediction) takesplace across slice boundaries.

The syntax supports multipicture MCP [36], [37]. That is,more than one previously decoded picture can be used as areference for MCP. Fig. 8 illustrates the concept. Previouslydecoded pictures are stored in a decoded picture buffer(DPB) as directed by the encoder, and a DPB referenceindex is associated with each motion-compensated 16 16,16 8, 8 16, or 8 8 luma block. MCP for smaller re-gions than 8 8 uses the same reference index for predictingall blocks in an 8 8 region.

A P macroblock can also be coded in the so-called P Skipmode. For this coding mode, neither a quantized predictionerror signal nor an MV with a reference index is sent. The re-constructed signal is obtained using only a prediction signal


like that of a P 16 16 macroblock that references the picturelocated at index 0 in the list (referred to as list 0) of picturesin the DPB. The MV used for reconstructing the P Skip mac-roblock is similar to the MV predictor for the 16 16 block.The useful effect of this P Skip coding type is that large areaswith no change or constant motion (like slow panning) canbe represented with very few bits.

Inter-Picture Prediction in B Slices: In comparison to priorvideo coding standards, the concept of B slices is generalizedin H.264/AVC. This extension refers back to [32]-[34] and isfurther studied in [74]. For example, other pictures can usereference pictures containing B slices for MCP, depending onwhether the encoder has selected to indicate that the B picturecan be used for reference. Thus, the substantial difference be-tween B and P slices is that B slices are coded in a manner inwhich some macroblocks or blocks may use a weighted av-erage of two distinct MCP values for building the predictionsignal. B slices use two distinct lists of reference pictures inthe DPB, which are referred to as the first (list 0) and second(list 1) reference picture lists, respectively.

B slices use a similar macroblock partitioning as P slices.Beside the P 16 16 P 16 8 P 8 16 P 8 8, and the Intracoding types, bipredictive prediction and another type ofprediction called Direct prediction are provided. For each16 16, 16 8, 8 16, and 8 8 partition, the predictionmethod (list 0, list 1, bipredictive) can be chosen separately.An 8 8 partition of a B macroblock can also be coded inDirect mode. If no prediction error signal is transmitted fora Direct macroblock mode, it is also referred to as B Skipmode and can be coded very efficiently, similar to the P Skipmode in P slices. The MV coding is similar to that of P sliceswith the appropriate modifications because neighboringblocks may be coded using different prediction modes.

Weighted Prediction in P and B Slices: In previous stan-dards, biprediction has typically been performed with asimple (1/2, 1/2) averaging of the two prediction signals,and the prediction in the so-called P macroblock types hasnot used weighting. However, in H.264/AVC, an encodercan specify scaling weights and offsets to be used for eachprediction signal in the P and B macroblocks of a slice. Theweighting and offset values can be inferred from temporallyrelated relationships or can be specified explicitly. It is evenallowed for different weights and offsets to be specifiedwithin the same slice for performing MCP using the sameparticular reference picture.

5) Transform, Scaling, and Quantization: Similar toprevious video coding standards, H.264/AVC uses spatialtransform coding of the prediction residual. However, inH.264/AVC, the transformation is applied to 4 4 blocks(instead of the larger 8 8 blocks used in previous stan-dards), and instead of providing a theoretical inverse DCTformula to be approximated by each implementer withinspecified tolerances, a separable integer transform withsimilar properties to a 4 4 DCT is used. Its basic matrix is

The transform coding process is similar to that in pre-vious standards, but since the inverse transform is defined byvery simple exact integer operations, inverse-transform mis-matches are avoided and decoding complexity is minimized.There are several reasons for using a smaller transform size(4 4) than was used in prior standards (8 8).

• One of the main improvements of the present standardis the improved prediction process both for Inter andIntra. Consequently, the residual signal has less spatialcorrelation. This generally means that the transform hasless to offer concerning decorrelation, so a 4 4 trans-form is essentially as efficient.

• With similar objective compression capability, thesmaller 4 4 transform has visual benefits resultingin less noise around edges (referred to as “mosquitonoise” or “ringing” artifacts).

• The smaller transform requires less computation and asmaller processing word length.

For the luma component in the Intra 16 16 mode andfor the chroma components in all Intra macroblocks, theDC coefficients of the 4 4 transform blocks undergo asecond transform, with the result that the lowest-frequencytransform basis functions cover the entire macroblock.This additional transform is 4 4 for the processing of theluma component in Intra 16 16 mode and is 2 2 for theprocessing of each chroma component in all Intra modes.Extending the length of the lowest-frequency basis functionsby applying such a secondary transform tends to improvecompression performance for very smooth regions.

A quantization parameter (QP) is used for determining thequantization of transform coefficients in H.264/AVC. It cantake on 52 values. The quantization step size is controlledlogarithmically by QP rather than linearly as in previous stan-dards, in a manner designed to reduce decoding complexityand enhance bit rate control capability. Each increase of six inQP causes a doubling of the quantization step size, so each in-crease of one in QP increases the step size by approximately12%. (Often a change of step size by approximately 12% alsomeans roughly a reduction of bit rate by approximately 12%.)

The quantized transform coefficients of a block generallyare scanned in a zigzag fashion and transmitted using entropycoding. The 2 2 DC coefficients of the chroma componentare scanned in raster-scan order.

All inverse transform operations in H.264/AVC canbe implemented using only additions, subtractions, andbit-shifting operations on 16-b integer values, and thescaling can be done using only 16 b as well. Similarly, only16-b memory accesses are needed for a good implementa-tion of the forward transform and quantization processes inthe encoder. For more information, see [75].

6) Entropy Coding: In H.264/AVC, two alternatives forentropy coding are supported. These are called context-adap-tive variable-length coding (CAVLC) and context-adaptivebinary arithmetic coding (CABAC). CABAC has highercomplexity than CAVLC, but has better coding efficiency.

In both of these modes, many syntax elements are codedusing a single infinite-extent codeword set referred to as anExp-Golomb code. Thus, instead of designing a different


VLC table for each syntax element, only the mapping to thesingle codeword table is customized to the data statistics. TheExp–Golomb code has a simple and regular structure.

When using CAVLC, the quantized transform coefficientsare coded using VLC tables that are switched depending onthe values of previous syntax elements. Since the VLC tablesare context conditional, the coding efficiency is better thanfor schemes using a single VLC table, such as the simple“run level” or “run level last” coding found in previousstandards. More details can be found in [23] and [69].

The efficiency can be improved further using CABAC[76]. CABAC not only uses context-conditional probabilityestimates, but adjusts its probability estimates to adapt tononstationary statistical behavior. Its arithmetic coding alsoenables the use of a noninteger number of bits to encodeeach symbol of the source alphabet (which can be espe-cially beneficial when the source symbol probabilities arehighly skewed). The arithmetic coding core engine andits associated probability estimation use low-complexitymultiplication-free operations involving only shifts and tablelookups. Compared to CAVLC, CABAC typically reducesthe bit rate 10%–15% for the same quality.

7) In-Loop Deblocking Filter: One annoying charac-teristic of block-based coding is the production of visibleblock artifacts, especially at low bit rates. Block edges aretypically predicted by MCP with less accuracy than interiorsamples, and block transforms also produce block edgediscontinuities. Blocking is generally considered to be oneof the most visible artifacts with the present compressionmethods. For this reason, H.264/AVC defines an adaptivein-loop deblocking filter. A detailed description of the de-blocking filter can be found in [77].

The filter reduces blockiness while basically retainingthe sharpness of the true edges in the scene. Consequently,the subjective quality is significantly improved. The filtertypically reduces bit rate by 5%–10% for the same objectivequality as the nonfiltered video, and improves subjectivequality even more. Fig. 9 illustrates the visual effect.

8) Adaptive Frame/Field Coding Operation: Interlacedframes often show different statistical properties than pro-gressive frames. H.264/AVC allows the following interlace-specific coding methods:

• frame mode: combine the two fields together as a frameand to code the entire frame as a picture;

• field mode: not combining the two fields and insteadcoding each single field as a separate picture;

• macroblock-adaptive frame/field mode (MBAFF):coding the entire frame as a picture, but enabling theselection of individual pairs of vertically adjacentmacroblocks within the picture to be split into fieldsfor prediction and residual coding.

The choice between the three options can be madeadaptively for each frame in a sequence. Choosing just be-tween the first two options is referred to as picture-adaptiveframe/field (PAFF) coding. When a picture is a single field,each field is partitioned into macroblocks and is coded in amanner very similar to a frame, except MCP uses referencefields rather than reference frames, the zigzag scan for trans-

form coefficients is different, and the strongest deblockingstrength is not used for filtering across horizontal edges ofmacroblocks in fields, because the field rows are spatiallytwice as far apart as frame rows (effectively lengthening thefilter).

For MBAFF coding, the frame/field encoding decision canalso be made for each vertical pair of macroblocks in a frame(a 16 32 luma region). For a macroblock pair that is codedin frame mode, each macroblock contains lines from bothfields. For a field mode macroblock pair, one macroblockcontains top field lines and the other contains bottom fieldlines. Each macroblock of a field macroblock pair is pro-cessed in essentially the same way as a macroblock withina field in PAFF coding. Note that, unlike in MPEG-2, theMBAFF frame/field decision is made at a macroblock pairlevel rather than within the macroblock level. This keeps thebasic macroblock processing structure the same for each pre-diction or residual coding operation, and permits field modeMCP block sizes as large as an entire macroblock.

During the development of the H.264/AVC standard, forkey ITU-R BT.601 [8] resolution sequences chosen as rep-resentative for testing, PAFF coding was reported to reducebit rates roughly 15%–20% over frame-only coding for se-quences like “Canoa,” “Rugby,” etc.; and MBAFF codingwas reported to reduce bit rates roughly 15% over PAFF forsequences like “Mobile & Calendar” and “News.”

9) Hypothetical Reference Decoder: A key benefit pro-vided by a standard is the assurance that all decoders that con-form to the standard will be able to decode any conformingcompressed video bitstream (given the appropriate profileand level capabilities as discussed below). To achieve that, itis not sufficient to just specify the syntax of the data and howto interpret it. It is also important to constrain how fast the bit-stream data can be fed to a decoder and how much bufferingof the bitstream and decoded pictures is required to build adecoder. Specifying input and output buffer models and de-veloping an implementation-independent idealized model ofa decoder achieves this. That receiver model is also called ahypothetical reference decoder (HRD) (see [78]).

The H.264/AVC HRD specifies operation of an idealizeddecoder with two buffers having specified capacity con-straints: the coded picture buffer (CPB) and the DPB. TheCPB models the arrival and removal timing of the codedbits and the DPB models the storage for decoded pictures.The HRD design is similar in spirit to what MPEG-2 had,but is more flexible for sending video at a variety of bit ratesand without excessive delay, and it provides flexible DPBmanagement for highly generalized multipicture buffering.

10) Profiles and Levels: Profiles and levels specify con-formance points to facilitate interoperability for various ap-plications. Ordinarily a profile defines a syntax that can beused in generating a conforming bitstream, whereas a levelplaces constraints on the values of key parameters (such asmaximum bit rate, buffering capacity, or picture resolution).

All decoders conforming to a specific profile must supportall features in that profile. Encoders are not required to makeuse of any particular set of features supported in a profile butmust provide conforming bitstreams, i.e., bitstreams that can


Fig. 9. Performance of the deblocking filter for highly compressed pictures. Top: withoutdeblocking filter; bottom: with deblocking filter.

be decoded by conforming decoders. In H.264/AVC, threeprofiles are defined. These are the Baseline, Main, and Ex-tended profiles.

The features of the H.264/AVC design can be segmentedinto the following five elemental sets.

• Set 0 (basic features for efficiency, robustness, and flex-ibility): I and P slices, CAVLC, and other basics.

• Set 1 (enhanced robustness/flexibility features): FMO,ASO, and redundant slices.

• Set 2 (further enhanced robustness/flexibility features):SP/SI slices and slice data partitioning.

• Set 3 (enhanced coding efficiency features) B slices,weighted prediction, field coding, and macroblockadaptive frame/field coding.

• Set 4 (a further coding efficiency feature): CABAC.

The Baseline profile, which emphasizes coding efficiencyand robustness with low computational complexity, supportsthe features of sets 0 and 2. The Main profile, which empha-


Fig. 10. PSNR-rate curves for the test sequence “Tempete” invideo streaming applications.

sizes primarily coding efficiency alone, supports the featuresof sets 0, 3, and 4. The Extended profile, which emphasizesrobustness and flexibility with high coding efficiency, sup-ports the features of sets 0, 1, 2, and 3 (all features exceptCABAC).

Since the Main profile does not support the FMO, ASO,and redundant slice feature of set 1, some bitstreams that aredecodable by a Baseline profile decoder are not decodable bya Main profile decoder. Similarly, because of noncommon-ality for sets 3 and 4, some bitstreams that are decodable bya Main profile decoder are not decodable by an Extended pro-file decoder and vice versa. To address this issue, flags in thesequence parameter set are used to indicate which profilescan decode each video sequence.

In H.264/AVC, the same set of levels is used with allprofiles, and individual implementations may support adifferent level for each supported profile. Fifteen levels aredefined, specifying upper limits for picture size (from 99 to36 864 macroblocks per picture), decoder-processing rates(from 1485 to 983 040 macroblocks per second), CPB size,DPB size, bit rate (from 64 kb/s to 240 Mb/s), etc.

C. Performance Comparisons

To illustrate the performance gains that can be achievedwhen using H.264/AVC, we report the results of an experi-ment targeting video streaming applications (one of severalapplication experiments reported in [51]). The measure of fi-delity is luma peak signal-to-noise ratio (PSNR), which is themost widely used such objective video quality measure

PSNR MSE

where MSE is the mean squared error between the originaland the corresponding decoding sample values.

The four codecs compared use bitstreams conforming tothe following standards:

• H.262 MPEG-2 Visual, Main Profile (MPEG-2);• H.263, High Latency Profile (HLP);• MPEG-4 Visual, Advanced Simple Profile (ASP);• H.264/AVC, Main Profile (MP).

Such applications generally support low to medium bitrates and picture resolutions, with quarter-CIF (QCIF)(176 144) resolution at 10–256 kb/s and CIF (352 288)

resolution at 128–1024 kb/s being common. The set of testsequences for this comparison consists of four QCIF (10and 15 Hz) and four CIF (15 Hz and 30 Hz) sequences. Seedetails and further test results in [51].

Fig. 10 shows PSNR versus bit rate curves for the sequence“Tempete.” For this sequence and for all others in the test set,H.264/AVC significantly outperforms the other codecs.

VI. CONCLUSIONS AND FURTHER DEVELOPMENTS

H.264/AVC has been developed and standardized collab-oratively by both the ITU-T VCEG and ISO/IEC MPEG or-ganizations. H.264/AVC represents a number of advances instandard video coding technology, in terms of coding effi-ciency improvement, error/loss robustness enhancement, andflexibility for effective use over a broad variety of networktypes and application domains. Its VCL design is based onconventional block-based motion-compensated hybrid videocoding concepts, but with some important differences rela-tive to prior standards, which include:

• enhanced motion prediction capability;• use of a small block-size exact-match transform;• adaptive in-loop deblocking filter;• enhanced entropy coding methods.

When used well together, the features of the new designprovide approximately a 50% bit rate savings for equiva-lent perceptual quality relative to the performance of priorstandards (especially for higher latency applications whichallow some use of reverse temporal prediction). The per-formance of the H.264/AVC compliant encoder in exper-iments reported here and elsewhere (e.g., in [51]) clearlydemonstrates the potential importance of this standard in fu-ture applications of video broadcast and streaming as well asinteractive video coding.2

Since the completion of the first version of the H.264/AVCstandard, the JVT experts group has done further work toextend the capabilities of H.264/AVC with important newenhancements known as the Fidelity Range Extensions(FRExt), including four new profiles (the High, High 10,High 4:2:2, and High 4:4:4 profiles). The FRExt enhance-ments could not be included in the scope of this paper dueto the scheduling of the publication process and the need forbrevity.

ACKNOWLEDGMENT

The authors would like to thank the experts of ITU-TVCEG, ISO/IEC MPEG, and the ITU-T/ISO/IEC JointVideo Team for their contributions. The authors would espe-cially like to thank G. Bjøntegaard, A. Joch, F. Kossentini,A. Luthra, T. Hinz, D. Marpe, H. Schwarz, and D. Zier.

REFERENCES

[1] “Generic coding of moving pictures and associated audio infor-mation—Part 1: Systems,” Int. Telecommun. Union-Telecommun.(ITU-T) and Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC)JTC 1, Recommendation H.222.0 and ISO/IEC 13 818-1 (MPEG-2Systems), Nov. 1994.

[2] “Narrow-band visual telephone systems and terminal equipment,”Int. Telecommun. Union-Telecommun. (ITU-T), RecommendationH.320, 1999.

2Further information and documents of the project are available athttp://ftp3.itu.int/av-arch in the directories video-site and jvt-site.


[3] “Packet-based multimedia communications systems,” Int.Telecommun. Union-Telecommun. (ITU-T), RecommendationH.323, 1998.

[4] “Terminal for low bit rate multimedia communication,” Int.Telecommun. Union-Telecommun. (ITU-T), RecommendationH.324, 1996.

[5] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson,R. Sparks, M. Handley, and E. Schooler, “SIP: Session InitiationProtocol,” Internet Eng. Task Force (IETF), Request for Comments(RFC) 3261, 2002.

[6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP:A transport protocol for real-time applications,” Internet Eng. TaskForce (IETF), Request for Comments (RFC) 1889, 1996.

[7] C. E. Shannon, “Coding theorems for a discrete source with a fidelitycriterion,” IRE Nat. Conv. Rec., pt. 4, pp. 142–163, 1959.

[8] “Studio encoding parameters of digital television for standard 4 : 3and wide-screen 16 : 9 aspect ratios,” Int. Telecommun. Union-Ra-diocommun. (ITU-R), Recommendation BT.601-5, 1995.

[9] H. Enomoto and K. Shibata, “Features of Hadamard transformedtelevision signal,” presented at the Nat. Conf. IECE Jpn., 1965, Paper881.

[10] H. C. Andrews and W. K. Pratt, “Fourier transform coding ofimages,” in Proc. Hawaii Int. Conf. System Sciences, 1968, pp.677–679.

[11] “Digital Compression and Coding of Continuous-Tone Still Images,”Int. Telecommun. Union-Telecommun. (ITU-T) and Int. StandardsOrg./Int. Electrotech. Comm. (ISO/IEC) Joint Tech. Comm. (JTC)1, Recommendation T.81 and ISO/IEC 10 918-1 (JPEG), Sep. 1992.

[12] N. Ahmed, T. Natarajan, and K. R. Rao, “On image processing anda discrete cosine transform,” IEEE Trans. Comput., vol. C-23, no. 1,pp. 90–93, Jan. 1974.

[13] R. D. Kell, “Improvements relating to electric picture transmissionsystems,” British Patent 341 811, 1929.

[14] C. Reader, “History of MPEG video compression—Ver. 4.0,” JointVideo Team (JVT), JVT-E066, Oct. 2002.

[15] F. W. Mounts, “A video encoding system with conditional pictureelement replenishment,” Bell Syst. Tech. J., vol. 48, no. 7, pp.2545–2554, Sep. 1969.

[16] “Codec for videoconferencing using primary digital group transmis-sion,” Int. Telecommun. Union-Telecommun. (ITU-T), Recommen-dation H.120, version 1, 1984; version 2, 1988; version 3, 1993.

[17] A. Habibi, “Hybrid coding of pictorial data,” IEEE Trans. Commun.,vol. COM-22, no. 5, pp. 614–624, May 1974.

[18] “Video codec for audiovisual services at p � 64 kbit/s,” Int.Telecommun. Union-Telecommun. (ITU-T), RecommendationH.261, version 1, 1990; version 2, 1993.

[19] “Coding of moving pictures and associated audio for digital storagemedia at up to about 1.5 Mbit/s—Part 2: Video,” Int. StandardsOrg./Int. Electrotech. Comm. (ISO/IEC) JTC 1, ISO/IEC 11 172-2(MPEG-1), Mar. 1993.

[20] “Generic coding of moving pictures and associated audio infor-mation—Part 2: Video,” Int. Telecommun. Union-Telecommun.(ITU-T) and Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC)JTC 1, Recommendation H.262 and ISO/IEC 13 818-2 (MPEG-2Video), Nov. 1994.

[21] “Video coding for low bit rate communication,” Int. Telecommun.Union-Telecommun. (ITU-T), Recommendation H.263, version 1,1995; version 2, 1998; version 3, 2000.

[22] “Coding of audio-visual objects—Part 2: Visual,” Int. StandardsOrg./Int. Electrotech. Comm. (ISO/IEC) JTC 1, ISO/IEC 14 496-2(MPEG-4 visual version 1), 1999–2003.

[23] “Advanced video coding for generic audiovisual services,” Int.Telecommun. Union-Telecommun. (ITU-T) and Int. StandardsOrg./Int. Electrotech. Comm. (ISO/IEC) JTC 1, RecommendationH.264 and ISO/IEC 14 496-10 (MPEG-4) AVC, 2003.

[24] Y. Taki, M. Hatori, and S. Tanaka, “Interframe coding that followsthe motion,” in Proc. Institute of Electronics and CommunicationEngineers Jpn. Annu. Conv. (IECEJ), 1974, p. 1263.

[25] J. R. Jain and A. K. Jain, “Displacement measurement and its ap-plication in interframe image coding,” IEEE Trans. Commun., vol.COM-29, no. 12, pp. 1799–1808, Dec. 1981.

[26] S. Brofferio and F. Rocca, “Interframe redundancy reductionof video signals generated by translating objects,” IEEE Trans.Commun., vol. COM-25, pp. 448–455, Apr. 1977.

[27] B. Girod, “The efficiency of motion-compensating prediction for hy-brid coding of video sequences,” IEEE J. Sel. Areas Commun., vol.5, no. 7, pp. 1140–1154, Aug. 1987.

[28] , “Motion-compensating prediction with fractional-pel accu-racy,” IEEE Trans. Commun., vol. 41, no. 4, pp. 604–612, Apr.1993.

[29] G. J. Sullivan and R. L. Baker, “Motion compensation for videocompression using control grid interpolation,” in Proc. IEEE Int.Conf. Acoustics, Speech, and Signal Processing (ICASSP), 1991, pp.2713–2716.

[30] T. Hidaka, “Description of the proposing algorithm and its scorefor moving image (A part of the proposal package),” Int. StandardsOrg./Int. Electrotech. Comm. (ISO/IEC) JTC 1, ISO/IEC JTC 1/SC2/WG 8 MPEG 89/188, Oct. 1989.

[31] F. Giorda and A. Racciu, “Bandwidth reduction of video signals viashift vector transmission,” IEEE Trans. Commun., vol. 23, no. 9, pp.1002–1004, Sep. 1975.

[32] G. J. Sullivan, “Multi-hypothesis motion compensation for low bit-rate video coding,” in Proc. IEEE Int. Conf. Acoustics, Speech, andSignal Processing (ICASSP), 1993, pp. 437–440.

[33] M. T. Orchard and G. J. Sullivan, “Overlapped block motion com-pensation: An estimation-theoretic approach,” IEEE Trans. ImageProcess., vol. 3, no. 5, pp. 693–699, Sep. 1994.

[34] M. Flierl, T. Wiegand, and B. Girod, “A locally optimal design al-gorithm for block-based multi-hypothesis motion-compensated pre-diction,” in Proc. IEEE Data Compression Conf. (DCC), 1998, pp.239–248.

[35] B. Girod, “Efficiency analysis of multi-hypothesis motion-compen-sated prediction,” IEEE Trans. Image Process., vol. 9, no. 2, pp.173–183, Feb. 2000.

[36] T. Wiegand, X. Zhang, and B. Girod, “Long-term memory motion-compensated prediction,” IEEE Trans. Circuits Syst. Video Technol.,vol. 9, no. 1, pp. 70–84, Feb. 1999.

[37] T. Wiegand and B. Girod, Multi-Frame Motion-Compensated Pre-diction for Video Transmission. Norwell, MA: Kluwer, 2001.

[38] H. Everett, “Generalized Lagrange multiplier method for solvingproblems of optimum allocation of resources,” Oper. Res., vol. 11,pp. 399–417, 1963.

[39] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitraryset of quantizers,” IEEE Trans. Acoust., Speech, Signal Process., vol.36, pp. 1445–1453, Sep. 1988.

[40] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy-constrainedvector quantization,” IEEE Trans. Acoust., Speech, Signal Process.,vol. 37, no. 1, pp. 31–42, Jan. 1989.

[41] G. J. Sullivan and R. L. Baker, “Rate-distortion optimized mo-tion compensation for video compression using fixed or variablesize blocks,” in Proc. IEEE Global Telecommunications Conf.(GLOBECOM), 1991, pp. 85–90.

[42] S.-W. Wu and A. Gersho, “Enhanced video compression withstandardized bitstream syntax,” in Proc. IEEE Int. Conf. Acoustics,Speech, and Signal Processing (ICASSP), vol. 1, 1993, pp. 103–106.

[43] K. Ramchandran, A. Ortega, and M. Vetterli, “Bit allocation fordependent quantization with applications to multiresolution andMPEG video coders,” IEEE Trans. Image Process., vol. 3, no. 5, pp.533–545, Sep. 1994.

[44] A. Ortega, K. Ramchandran, and M. Vetterli, “Optimal trellis-basedbuffered compression and fast approximations,” IEEE Trans. ImageProcess., vol. 3, no. 1, pp. 26–40, Jan. 1994.

[45] T. Wiegand, M. Lightstone, D. Mukherjee, T. G. Campbell, and S.K. Mitra, “Rate-distortion optimized mode selection for very low bitrate video coding and the emerging H.263 standard,” IEEE Trans.Circuits Syst. Video Technol., vol. 6, no. 2, pp. 182–190, Apr. 1996.

[46] M. C. Chen and A. N. Willson, “Design and optimization of a differ-entially coded variable block size motion compensation system,” inProc. IEEE Int. Conf. Image Processing, vol. 3, 1996, pp. 259–262.

[47] G. M. Schuster and A. K. Katsaggelos, “A video compressionscheme with optimal bit allocation among segmentation, mo-tion, and residual error,” IEEE Trans. Image Process., vol. 6, pp.1487–1502, Nov. 1997.

[48] M. C. Chen and A. N. Willson, “Rate-distortion optimal motionestimation algorithms for motion-compensated transform videocoding,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 2, pp.147–158, Apr. 1998.

[49] A. Ortega and K. Ramchandran, “Rate-distortion methods for imageand video compression: An overview,” IEEE Signal Process. Mag.,pp. 23–50, Nov. 1998.

[50] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization forvideo compression,” IEEE Signal Process. Mag., vol. 15, pp. 74–90,Nov. 1998.


[51] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan,“Rate-constrained coder control and comparison of video codingstandards,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no.7, pp. 688–703, Jul. 2003.

[52] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Dis-tributed video coding,” Proc. IEEE, vol. 93, no. 1, pp. 71–83, Jan.2005.

[53] E. Steinbach, N. Färber, and B. Girod, “Standard compatible exten-sion of H.263 for robust video transmission in mobile environments,”IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 6, pp. 872–881,Dec. 1997.

[54] B. Girod and N. Färber, “Feedback-based error control for mobilevideo transmission,” Proc. IEEE, vol. 97, no. 10, pp. 1707–1723,Oct. 1999.

[55] Q. F. Zhu and L. Kerofsky, “Joint source coding, transport pro-cessing, and error concealment for H.323-based packet video,”Proc. SPIE, Visual Commun. Image Process., vol. 3653, pp. 52–62,Jan. 1999.

[56] R. O. Hinds, T. N. Pappas, and J. S. Lira, “Joint block-based videosource/channel coding for packet-switched networks,” in Proc.SPIE, Visual Commun. Image Process., vol. 3309, Jan. 1998, pp.124–133.

[57] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with op-timal Inter/Intra mode switching for packet loss resilience,” IEEE J.Sel. Areas Commun., vol. 18, no. 6, pp. 966–976, Jun. 2000.

[58] G. Bjøntegaard, “An error resilience method based on backchannel signaling and FEC,” Telenor R&D, ITU-T, SG15/WP15/1,LBC-95-033, Jan. 1996. Also submitted to ISO/IEC JTC 1/SC29/WG 11 as MPEG96/M0616.

[59] S. Fukunaga, T. Nakai, and H. Inoue, “Error-resilient video codingby dynamic replacing of reference pictures,” in IEEE GlobalTelecommunications Conf. (GLOBECOM), vol. 3, 1996, pp.1503–1508.

[60] Y. Tomita, T. Kimura, and T. Ichikawa, “Error resilient modifiedinter-frame coding system for limited reference picture memories,”in Proc. Picture Coding Symp., 1997, pp. 743–748.

[61] M. Budagavi and J. D. Gibson, “Multiframe block motion compen-sated video coding for wireless channels,” in Proc. Asilomar Conf.Signals, Systems, and Computers, vol. 2, 1996, pp. 953–957.

[62] , “Error propagation in motion compensated video over wirelesschannels,” in Proc. IEEE Int. Conf. Image Processing (ICIP), vol. 2,1997, pp. 89–92.

[63] , “Random lag selection in multi-frame motion compensation,”presented at the IEEE Int. Symp. Information Theory, Boston, MA,1998.

[64] T. Wiegand, N. Färber, K. Stuhlmüller, and B. Girod, “Error-re-silient video transmission using long-term memory motion-compen-sated prediction,” IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp.1050–1062, Jun. 2000.

[65] P. Salama, N. B. Shroff, and E. J. Delp, “Error concealment in en-coded video,” in Image Recovery Techniques for Image CompressionApplications. Norwell, MA: Kluwer, 1998.

[66] W. M. Lam, A. R. Reibman, and B. Liu, “Recovery of lost orerroneously received motion vectors,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, 1993,pp. 417–420.

[67] V. Varsa, M. M. Hannuksela, and Y.-K. Wang, “Non-normative errorconcealment algorithms,” Int. Telecommun. Union-Telecommun.(ITU-T), VCEG-N62, Sep. 2001.

[68] Y.-K. Wang, M. M. Hannuksela, V. Varsa, A. Hourunranta, and M.Gabbouj, “The error concealment feature in the H.26L test model,”in Proc. IEEE Int. Conf. Image Processing (ICIP), vol. 2, 2002, pp.729–732.

[69] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra,“Overview of the H.264/AVC video coding standard,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[70] S. Wenger, “H.264/AVC over IP,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 645–656, Jul. 2003.

[71] T. Stockhammer, M. M. Hannuksela, and T. Wiegand, “H.264/AVCin wireless environments,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 657–673, Jul. 2003.

[72] M. Karczewicz and R. Kurçeren, “The SP and SI frames design forH.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no.7, pp. 637–644, Jul. 2003.

[73] T. Wedi and H. G. Musmann, “Motion- and aliasing-compensatedprediction for hybrid video coding,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 577–586, Jul. 2003.

[74] M. Flierl and B. Girod, “Generalized B pictures and the draftJVT/H.264 video compression standard,” IEEE Trans. Circuits Syst.Video Technol., vol. 13, no. 7, pp. 587–597, Jul. 2003.

[75] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598–603, Jul. 2003.

[76] D. Marpe, H. Schwarz, and T. Wiegand, “Context-adaptive binaryarithmetic coding for H.264/AVC,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 620–636, Jul. 2003.

[77] P. List, A. Joch, J. Lainema, G. Bjøntegaard, and M. Karczewicz,“Adaptive deblocking filter,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 13, no. 7, pp. 614–619, Jul. 2003.

[78] J. Ribas-Corbera, P. A. Chou, and S. L. Regunathan, “A generalizedhypothetical reference decoder for H.264/AVC,” IEEE Trans. Cir-cuits Syst. Video Technol., vol. 13, no. 7, pp. 674–687, Jul. 2003.

Gary J. Sullivan received the B.S. and M.Eng.degrees from the University of Louisville,Louisville, KY, in 1982 and 1983, respectively,and the Ph.D. and Engineer degrees in electricalengineering from the University of California,Los Angeles, in 1991.

He was the Manager of Communications CoreResearch at PictureTel Corporation, the formerworld leader in videoconferencing communi-cation. He was a Howard Hughes Fellow andMember of the Technical Staff in the Advanced

Systems Division of Hughes Aircraft Corporation and was a Terrain-Fol-lowing Radar System Software Engineer for Texas Instruments. He iscurrently a Video Architect in the Core Media Processing Team of theWindows Digital Media Division of Microsoft Corporation, Redmond,WA. At Microsoft he designed and remains active in the extension of theDirectX Video Acceleration (DXVA) API/DDI feature of the MicrosoftWindows operating system platform. He is the Chairman of the Joint VideoTeam (JVT) for the development of the latest international video codingstandard known as H.264/AVC, which has recently been completed as ajoint project between the ITU-T Video Coding Experts Group (VCEG) andthe ISO/IEC Moving Picture Experts Group (ISO/IEC JTC 1/SC 29/WG11, MPEG). He is also the Rapporteur of Advanced Video Coding in theITU-T, where he has led VCEG (ITU-T Q.6/SG16) for about eight years.He is also the ITU-T video liaison representative to MPEG and he formerlyserved as MPEG’s video chairman from March of 2001 to May of 2002.His research interests and areas of publication include image and videocompression, rate-distortion optimization, motion representation, scalarand vector quantization, and error and packet loss resilient video coding.

Thomas Wiegand received the Dr.-Ing. degreefrom the University of Erlangen-Nuremberg,Germany, in 2000 and the Dipl.-Ing. degreein electrical engineering from the TechnicalUniversity of Hamburg-Harburg, Germany, in1995.

He is the head of the Image CommunicationGroup in the Image Processing Department ofthe Heinrich Hertz Institute, Berlin, Germany.From 1993 to 1994, he was a Visiting Researcherat Kobe University, Kobe, Japan. From 1997 to

1998, he was a Visiting Researcher at Stanford University, Stanford, CA,and served as a consultant to 8� 8, Inc., Santa Clara, CA. He started hisresearch on video compression and transmission in 1995, as a VisitingScholar at the University of California, Santa Barbara. Since then, he haspublished several conference and journal papers on the subject and hascontributed successfully to the ITU-T Video Coding Experts Group (ITU-TSG16 Q.6, VCEG), ISO/IEC Moving Picture Experts Group (ISO/IEC JTC1/SC 29/WG 11, MPEG), and Joint Video Team (JVT) standardizationefforts and holds various international patents in this field. He has beenappointed as the Associate Rapporteur of the ITU-T VCEG (October2000), the Associate Rapporteur/Cochair of the JVT that was created byITU-T VCEG and ISO/IEC MPEG for finalization of the H.264/AVC videocoding standard (December 2001), and as the Editor of the H.264/AVCvideo coding standard (February 2002).


Documents

Video Compression—From Concepts to the H.264/AVC …ip.hhi.de/imagecom_G1/assets/pdfs/pieee_sullivan_wiegand_2005.pdf · Video Compression—From Concepts to the ... The rate-distortion