19
ACM Multimedia 97 - Electronic Proceedings November 8-14, 1997 Crowne Plaza Hotel, Seattle, USA Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder Peter Soderquist School of Electrical Engineering Dept. of Electrical and Computer Engineering Cornell University Ithaca, New York 14853 (607) 255-0314 [email protected] http://www.ee.cornell.edu/~pgs/ Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, Massachusetts 02115 (617) 373-3814 [email protected] http://www.ece.neu.edu/ ACM Copyright Notice Abstract Multimedia functionality has become an established component of core computer workloads. MPEG-2 video decoding represents a particularly important and computationally demanding application example. Instruction set extensions like Intel's MMX significantly reduce the computational challenges of this and other multimedia algorithms. However, memory subsystem deficiencies have now become the major barrier to increased performance, partly as a consequence of this improved CPU performance. Decoding MPEG-2 video data in software makes significant bandwidth demands on memory subsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generate many times more cache-memory traffic than required, at best double the minimum necessary to support decoding. Improving efficiency requires understanding the behavior of the decoder and composition of its data set. We provide an analysis of the memory and cache behavior of software MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive handling of different data types can reduce traffic by 50 percent or more. Keywords

Chrome server2 print_http_

Embed Size (px)

DESCRIPTION

Arquitetura de Computadores

Citation preview

Page 1: Chrome server2 print_http_

ACM Multimedia 97 - Electronic Proceedings

November 8-14, 1997

Crowne Plaza Hotel, Seattle, USA

Optimizing the Data Cache Performance of aSoftware MPEG-2 Video Decoder

Peter SoderquistSchool of Electrical Engineering

Dept. of Electrical and Computer EngineeringCornell University

Ithaca, New York 14853(607) 255-0314

[email protected]://www.ee.cornell.edu/~pgs/

Miriam LeeserDept. of Electrical and Computer Engineering

Northeastern UniversityBoston, Massachusetts 02115

(617) [email protected]

http://www.ece.neu.edu/

ACM Copyright Notice

AbstractMultimedia functionality has become an established component of core computer workloads. MPEG-2video decoding represents a particularly important and computationally demanding applicationexample. Instruction set extensions like Intel's MMX significantly reduce the computational challenges ofthis and other multimedia algorithms. However, memory subsystem deficiencies have now become themajor barrier to increased performance, partly as a consequence of this improved CPU performance.Decoding MPEG-2 video data in software makes significant bandwidth demands on memorysubsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generatemany times more cache-memory traffic than required, at best double the minimum necessary tosupport decoding. Improving efficiency requires understanding the behavior of the decoder andcomposition of its data set. We provide an analysis of the memory and cache behavior of softwareMPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offerrelief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitivehandling of different data types can reduce traffic by 50 percent or more.

Keywords

Page 2: Chrome server2 print_http_

MPEG-2, digital video, caches, computer architecture

Table of Contents1 Introduction2 Background3 MPEG Overview4 Related Work

4.1 Media Processors4.2 SIMD Instruction Extensions4.3 Video/Graphics Data Buses4.4 Prefetching4.5 Our Approach: Cache-Oriented Enhancements

5 Experimental Setup5.1 Machine and Software Environment5.2 Steady-State Simulation5.3 Color Space and Bandwidth

6 Cache Simulation Results6.1 Cache Size6.2 Associativity6.3 Line Size6.4 Summary

7 Decoder Internal Analysis7.1 Data Set Composition7.2 Decoder Program Flow7.3 System and Audio Interaction

8 Cache-Oriented Enhancements8.1 Selective Caching8.2 Cache Locking8.3 Scratch Memory8.4 Data Reordering8.5 Cache Partitioning8.6 Summary

9 Conclusions10 AcknowledgmentsReferences

1 IntroductionMultimedia computing has become a practical reality, and computer architectures are changing inresponse. Extensions for continuous media processing are a current or planned part of every market-leading instruction set architecture, as shown in Table 1. The primary feature of these extensions isSIMD-style processing of small data types. While contemporary microprocessor datapaths are 32 or64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimediaalgorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapathsto process multiple small data types in a single instruction promises significant performanceimprovements.

Architecture ISA Extension

Page 3: Chrome server2 print_http_

x86 MMX (MultiMedia eXtensions)SPARC VIS (Visual Instruction Set)PA-RISC MAX-2 (Multimedia Acceleration eXtensions)Alpha MVI (Motion Video Instruction)MIPS MDMX (MIPS Digital Media eXtensions)PowerPC VMX (Video and Multimedia eXtensions)*

Table 1. ISA multimedia extensions for popular architectures (* = proposed)

All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, theymount a formidable assault on its computational complexity. Unfortunately they largely fail to addressthe significant challenge of MPEG processing for memory subsystems, which have become theprimary performance bottleneck. The high data rate, large set sizes, and distinctive memory accesspatterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size,associativity, and line size) fails to cost-effectively reduce excess memory traffic. While miss rate levelsare acceptable, standard caches, even very large and highly associative ones, generate significantexcess cache-memory traffic. Multimedia instruction extensions actually exacerbate this problem byenabling the CPU to consume and produce more data in fewer cycles.

This cache inefficiency is seriously limiting for desktop multitasking systems. A foregrounded videoprocess can potentially cripple communications, print spooling, system management, network agent,and other background tasks. The video playback may also suffer from dropped frames, jitter, blocking,or other annoying artifacts. For low-power/cost systems and information appliances, cache inefficiencycan have a direct cost impact, requiring the use of faster or higher capacity components than strictlynecessary to achieve specified functionality at the required quality level. This can drive up system cost,increase power consumption, or even prevent implementation.

To explore the cache efficiency problem and its solutions, we examine the traffic and data cachebehavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor.Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. Weshow that, contrary to some predictions, there is a benefit to even relatively small, simple caches.However, even very large and complex caches generate significant excess traffic. An analysis of thedata types utilized by the decoder and their patterns of access provides the basis for proposedarchitectural enhancements. We also show that even relatively simple measures, such as selectivecaching of specific data types, can dramatically improve efficiency, reducing cache-memory traffic by50% or more for a wide range of cache sizes.

<-- Table of Contents

2 BackgroundConsider the problem of decoding MPEG-2 in software on a general-purpose computing platform. Themachine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. Thediagram also illustrates the typical flow of video data in the system. Imagine the user is viewing, forexample, an entry in a multimedia encyclopedia. Compressed data is transferred from the file system(e.g. CD-ROM or hard disk) by direct memory access (DMA) and buffered in the main memory. TheCPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, oneframe at a time, to main memory where it is stored in an image of the displayed video window. Finally,another DMA transfer brings the video data to the frame buffer where it eventually appears on the

Page 4: Chrome server2 print_http_

display.

Figure 1. Typical desktop machine with software MPEG-2 decoding traffic

It is clear that there is a large amount of data, much of it time-sensitive, being transferred through thesystem. Most of this is concentrated on the main system bus. At the very least, there are two streamseach of encoded and decoded video being concurrently transferred in and out of main memory,amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by otherapplications running on the system and sharing interconnect, main memory, and peripherals. Anyexcess memory traffic generated by cache inefficiency will further exacerbate this situation.Unfortunately, our simulations indicate that standard caches generate at least twice the minimumrequired level of cache-memory traffic, typically many times more, amounting to a significant strain onsystem capacity. The result is some combination of dropped frames, reduced frame rate, anddegradation of overall system performance.

<-- Table of Contents

3 MPEG OverviewAppreciating the challenges of supporting video decoding requires some understanding of the MPEGstandard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achievecompression. Video data is broken down into 8 by 8 pixel Blocks and passed through a discretecosine transform (DCT). The resulting spatial frequency coefficients are quantized, run-length encoded,and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEGencoding uses motion compensation with three different types of frames. I (intra) frames contain acomplete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from16 by 16 fragments known as macroblocks. These consist primarily of pixels from the closest previousI or P frame (the reference frame), translated as a group from their location in the source. Thisinformation is stored as a vector representing the translation, and a DCT-encoded difference term,requiring far fewer bits then the original image fragment. B (bidirectional) frames can use the closesttwo I or P pictures - one before and one after in temporal order - as reference frames. Information notpresent in reference frames is encoded spatially on a block-by-block basis. All of data in P and Bframes is also subject to run-length and entropy coding.

Page 5: Chrome server2 print_http_

Figure 2. Typical sequence of MPEG frames, showing interframe dependencies

Figure 2 shows the interframe dependencies of the different frame types, superimposed on thedisplayed frame order. For decoding, these frames must be processed in the non-temporal order[1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and theproperties and sequence of frame types determine in critical ways the flow pattern of MPEG data andthe nature of the hardware support required.

Figure 3. MPEG bitstream structure

The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes markthe division between different sections of the data. The bitstream has a hierarchical structure, shown insimplified form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOPcontains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows apossible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist ofcollections of macroblocks called slices. The ``headers'' shown at each level contain parametersrelevant to each bitstream type but no actual image data.

<-- Table of Contents

4 Related WorkDeficiencies in the MPEG-2 decoding performance of systems are being actively addressed. Thissection discusses several significant methods for performance enhancement under investigation byresearchers both in industry and academia. Some of these are merely proposals, while others havealready been implemented in hardware or are forthcoming. We explore the benefits and disadvantages

Page 6: Chrome server2 print_http_

of each approach, and assess its impact on the problem of excess cache-memory bandwidth.

4.1 Media Processors

A new breed of microprocessor, the so-called ``media processor'' has emerged in just the past fewyears. These chips combine programmability with specialized hardware to support multiple concurrentmultimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktopsystems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, andSamsung, claim that media processors are the best present and long-term solution for providingmultimedia functionality in computer systems, not the host CPU.

To their advantage, media processors yield reduced memory bandwidth requirements, since they dealin compressed data, and take a significant computational load off the host. They are also good athandling the real-time constraints of multimedia, while present-day operating systems and memoryhierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tiedto international standards, an efficient target for highly specialized solutions.

However, performing multimedia functionality with the host processor is inherently less expensive thanusing extra specialized hardware. Memory systems are also improving to meet the needs ofmultimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are alsolikely to become more real-time oriented as they evolve. More fundamentally, multimedia data are anew first-order data type, on the same order as floating-point. These functions are becoming central towhat computers do, not ``peripheral'' like mass-storage or printing. As applications evolve, media andother operations will become more and more interleaved - someday viewing a video clip will be asseamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports theconstruction of highly integrated, fully featured, yet portable applications. Hardware integration trendsand the powerful vested interest of CPU manufacturers also favor this outcome.

Nevertheless, parts of video processing are likely to remain in specialized hardware for quite sometime, such as color space conversion, scaling, and dithering. These are straightforward to implement inhardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards andother impediments. These are more akin to truly ``peripheral'' operations.

<-- Related Work

4.2 SIMD Instruction Extensions

Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achievespeedup primarily through performing SIMD-style operations on multimedia data types. These ISAenhancements are an essential step to improving microprocessor performance on multimediaapplications - as well as insuring the continued viability of host processing. However, these operationstend to increase rather than relieve the strain on memory bandwidth. By consuming more operands perunit time, multimedia instructions expose the underlying weakness of the cache, system bus, and mainmemory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentiumfor this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX onmemory and I/O system deficiencies [9].

<-- Related Work

4.3 Video/Graphics Data Buses

There have been proposals for implementing special interconnect to transfer image data from the host

Page 7: Chrome server2 print_http_

processor to the frame buffer, bypassing slower general-purpose I/O buses. The Intel AdvancedGraphics Port (AGP) is the most prominent such effort, intended to provide a proposed direct linkbetween main memory and the graphics subsystem. While primarily meant to support the highbandwidth of 3D graphics rendering on the host processor, it is a definite boon to video as well. Thismethod provides another significant way to boost the capabilities of host multimedia processingrelative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache-memory traffic waste or reduce overall memory-system bandwidth needs. The undiminishedrequirement for extremely high-bandwidth memory is reflected in Intel's support of the Rambusarchitecture for future PC memory systems.

<-- Related Work

4.4 Prefetching

Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet itis primarily promoted as a means to reduce miss rates and without much consideration of cache-memory traffic, which tends to increase with prefetching, especially the more aggressive schemes.Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposesunderlying bandwidth problems, and tends to generate excess memory traffic of its own. Selectivelyapplied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimediaalgorithms. However, the emphasis of our research is not so much on getting values into the cacheearly, but keeping them around for as long as they are needed.

<-- Related Work

4.5 Our Approach: Cache-Oriented Enhancements

The methods discussed above largely bypass the issue of excess cache-memory traffic, while some ofthem even exacerbate the problem. None of them directly target the issue of cache inefficiency. Evenprefetching focuses on reducing miss rates at the possible expense of memory traffic. We suggest adifferent type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge ofindividual data types - their sizes, access types, and access patterns - can lead to effectivearchitectural solutions to cache inefficiency.

<-- Related Work

<-- Table of Contents

5 Experimental SetupSolving the problem of excess cache-memory traffic requires first establishing its extent, and howcritical system parameters affect this behavior. Our results come from trace-driven cache simulations ofvideo stream decoding. The clips themselves are progressive MPEG-2 Main Level sequences, with aresolution of 704 by 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These arecomparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences arechosen as representative of different types of programming.

5.1 Machine and Software Environment

Page 8: Chrome server2 print_http_

For experimental purposes we use a simplified machine model, limiting the microprocessor to a singlelevel of caching. We further assume that MPEG-2 decoding is the only active process. The decoder isa version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARCprocessor under Solaris. This decoder has the advantage of portability and relative systemindependence, with the disadvantage that performance is suboptimal for any given architecture. Sinceit originates with the body responsible for MPEG, it is closely written to comply with that standard.Finally, it has the essential feature for research purposes of being available in source code form.

We have chosen to treat the decoder implementation as a given, implementing minimal modification tosupport integration with our simulation tools. Trace generation is performed by QPT2 (the QuickProfiler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). Wehave implemented the cache simulator, Pacino, to support the unique qualities of continuous mediaapplications. These are considerably different from conventional benchmarks of the SPEC variety intheir consumption and production of data, and interaction with the operating system.

<-- Experimental Setup

5.2 Steady-State Simulation

To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restricteach run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if atsteady state - i.e. preceded and followed by a large number of other GOP's, as if plucked from themiddle of the sequence. This eliminates the side effects of program startup and termination. Toachieve this the results of running initialization code are not logged, the cache is primed with data toavoid the effects of cold start, and dirty lines are not flushed at the end of the GOP.

<-- Experimental Setup

5.3 Color Space and Bandwidth

The simulation framework implements one significant performance optimization beyond the generichardware model. Decoded video is sent to the frame buffer in 4:2:0 YCrCb form, the nativerepresentation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance(grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume thatupsampling, color space conversion, and any dithering are done on the fly by the video displaysubsystem. These features are becoming common in workstations and budget PC video acceleratorsalike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% oftotal execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is onlyhalf the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s.Finally, these conversion operations account for so many memory and instruction references that theymake trace-driven simulation prohibitive - a sufficiently compelling reason to avoid them.

<-- Experimental Setup

<-- Table of Contents

6 Cache Simulation ResultsTrace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what

Page 9: Chrome server2 print_http_

traffic is seen by the memory. First of all, there is a distinct benefit to caching video decoder data, evenusing naive, generic schemes. It has occasionally been suggested that caches are critically inefficientfor video data. Accordingly, several media processors dispense with data caches altogether in favor ofSRAM banks - essentially large register files - managed by software [5,3,2]. Yet we have found thatdespite the large set sizes and the magnitude of data consumed and discarded, there is sufficient re-use of values for caching to significantly reduce the required memory bandwidth.

For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2.1*10exp(8)memory requests. A 16 Kbyte, direct-mapped cache generates 3.9*10exp(7) words of memory traffic.If we estimate conservatively that each word of cache data represents a memory request, then even arelatively small, simple cache keeps almost 85% of traffic off of the memory bus. In fact, cachestypically load several words at a time, while many CPU memory requests are for data types smallerthan a word. Therefore, data caches can make far more efficient use of available memory bandwidththan registers alone. Even very clever register management would have trouble competing withanything but the smallest caches, so even information appliance-type platforms might benefit fromimplementing an actual cache.

The remainder of this section examines in more detail how the basic parameters of a simple cache(cache size, set associativity, and line size) affect memory traffic as well as misses. All of the caches inour simulations implement a write-back policy on replacement. Not only is this the most popular writepolicy in actual implementations, but it typically generates far less memory traffic than write-through, thebest alternative. Write-allocate, where a write miss causes the loading of the associated line, is also acommon feature of all simulated caches.

If we assume a ``perfect'' cache, the traffic required to support decoding is equal to the size of encodedstream, which is read in from memory, plus the decoded video data, which is written back. Thisabsolute minimum cache-memory traffic is used as a yardstick to compare the performance of differentconfigurations in our simulations.

6.1 Cache Size

The size of a cache is its most significant design parameter, certainly from a cost standpoint. Becausecache size usually increases or decreases by factors of two, the decision of how large a cache toimplement in a system is pivotal. For Main Level MPEG-2 decoding, the cache-memory traffic as afunction of cache size and associativity shows little variation between video sequences. Figure 4shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on asurface showing the minimum possible cache-memory traffic for the sequence, consisting of theencoded and decoded video streams combined.

Page 10: Chrome server2 print_http_

Figure 4. Typical cache-memory traffic for MPEG-2 Main Level decoding, over minimum possible

traffic level

The most prominent feature of the traffic function is a large plateau, at a level of 6.3 times the minimumvalue. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache-memory traffic changes little with increasing cache size for most of the range. Traffic starts to roll off at1 Mbyte, and plummets at 2 Mbytes. This reflects the 2 Mbyte size of the decoder data set. Largercaches show negligible improvement, with the additional space providing no extra benefit. However,the smallest measured value is still almost 2 times higher than the minimum possible value.

These data imply that a 32 Kbyte cache is just as good - or bad - for MPEG-2 Main Level decoding asa 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory traffic requires a 1 Mbytecache or higher, but a 2 Mbytes or larger cache only brings traffic down to double the absoluteminimum, leaving plenty of room for improvement.

<-- Cache Simulation Results

6.2 Associativity

Increasing set associativity is a popular method for getting higher performance out of smaller caches.The memory traffic effects of varying set associativity are also visible in Figure 4. Going from a direct-mapped cache to a 2-way set-associative one can reduce memory traffic by as much as 50% for smallcaches. Increasing associativity to 4 can squeeze out almost another 10% improvement over thedirect-mapped case. Set sizes of greater than 4, however, show minimal benefit across all cachesizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity andaccess time, such enhancements are not justified.

While memory traffic is the primary focus, we would prefer to not adversely affect miss rates. As afunction of cache size and set associativity, miss rates show the same behavioral pattern as memorytraffic. However, unlike for cache-memory traffic, the performance is more than satisfactory, with anaverage value below 0.5%.

<-- Cache Simulation Results

6.3 Line Size

Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock

Page 11: Chrome server2 print_http_

placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there iscontention between miss rate and memory traffic minimization. Low miss rates call for larger lines thanthe typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, butrequire more data to be read and possibly written back on a miss. For this reason, minimal memorytraffic occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for thesmallest caches, the largest line sizes lead to cache-memory traffic almost 200 times the absoluteminimum.

Figure 5. Typical miss rates for MPEG-2 Main Level decoding over different cache sizes

Figure 6. Typical cache-memory traffic for MPEG-2 Main Level decoding over different cache sizes

Balancing the demands of miss rates and memory traffic requires further investigation. For the timebeing, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory trafficreduction argue against a switch to larger values.

<-- Cache Simulation Results

6.4 Summary

It is clear that conventional cache techniques provide limited relief from excess cache-memory traffic.The simplest way to improve memory traffic and miss rates is to have a very large cache - but a large

Page 12: Chrome server2 print_http_

cache is not always feasible, and simulations show that improving MPEG-2 decoding performancerequires very big caches. In any case, one would prefer to extract better performance from smaller,lower-cost resources, improving performance through increased efficiency rather than brute force.

<-- Cache Simulation Results

<-- Table of Contents

7 Decoder Internal AnalysisThis section explores the internal functioning of a software-based MPEG-2 video decoder, as a firststep to explaining and rectifying its suboptimal use of cache resources. First, we dissect the differentkinds of data used by the decoder, including compressed input data, image output data, and all of theintermediate types. Second, we examine how these data objects are used in the process of decoding,and how this affects the performance of the cache.

7.1 Data Set Composition

The data set of an MPEG-2 decoder - the information that must be available to the program in thecourse of its execution - consists of several different data types which serve distinct purposes. As aresult, they are quite heterogeneous with respect to the types and patterns of access utilized by thedecoder, as well as the amount of memory space required. The following classes account for the globaland static local values in user space.

Input. The compressed MPEG-2 sequence; data are read in series from a fixed-size buffer andrefreshed by system calls.Output. Uncompressed picture data in YCrCb format, stored in a video window image buffer.This data type is write only -- written but never read by the CPU. System calls transfer eachcompleted picture into the frame buffer.Tabular. Static, read only information used in the MPEG decoding process, such as variouslookup tables.Reference. The current frame and the previously decoded frames used to reconstruct it (up to twofor a B frame), in YCrCb form.Block. The DCT coefficient and pixel values for a single macroblock.State. Values incidental to the settings and operation of the decoder, yet not part of the imagedata per se.

This partitioning of data types may not be explicit within the decoder source, but represents anabstraction at a higher level of the data needed for decoder operation. Most of these data types areboth read and written. The exceptions are the tabular data, which are read only, and the the output datawhich is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Readmisses and write misses in caches have very different latencies. The fact that some MPEG decoderdata types are only read or written can be exploited for performance optimization.

The essential properties of the different data types are summarized in Table 2. Note how in terms ofsize, the reference and output types completely dominate the others. With respect to caching, thismeans that their presence will tend to repeatedly expel the other types from the cache, except for verylarge data caches. It doesn't matter how many times the other values are re-used. Capacity limitationsalone will insure that reference and output data, when updated, will throw the other data types out.

Fraction of

Page 13: Chrome server2 print_http_

Data Type Access Type Size ReferencesInput read/write 2 KB 2.7%Output write only 500 KB 3.9%Tabular read only 5 KB 5.5%Reference read/write 1500 KB 23.7%Block read/write 1.5 KB 31.4%State read/write 0.5 KB 25.6%

Table 2. Summary of decoder data types, sizes, access types, and proportion of memory referencesaccounted for

Ranking the data types in terms of the number of references rather than by size gives a very differentpicture. Block and state data account for by far the most memory requests. Reference data is next, butoutput data is in next to last place, accounting for 8 times fewer memory requests than block data. Notethat the percentages don't add up to 100%; approximately 7.2% of data references are due totemporary variables and library functions.

<-- Decoder Internal Analysis

7.2 Decoder Program Flow

The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain thestructure of the decoder program. Figure 7 shows the phases of operation for a single GOP from theperspective of data accesses. As the diagram makes evident, execution proceeds as a set of nestedloops corresponding to different data levels in the bitstream as shown in Figure 3. The ``initialization''and ``termination'' blocks represent the operations excluded by steady-state simulation.

Page 14: Chrome server2 print_http_

Figure 7. Model of MPEG-2 software decoder showing phases of operation for a GOP

Table 3 shows which data types are accessed in the different phases of operation. Parsing headerinformation and reading in block data involves operating on relatively small data types, and results in asmall active cache set. However, during reconstruction, the decoder accesses portions of the referenceframes and copies them into the frame currently under construction. The Inverse DCT (IDCT) phasefocuses on smaller data types, but merging the decoded pixels back into the new picture requires onceagain accessing large data types. Writing the completed frame requires traversing significant portionsof reference and output data. Notice that the decoder accesses state data on a fairly continuous basis.

Phases ofOperation

Data Typesinput output tabular reference block state

read headers X X Xread block X X X Xreconstruct MB X XIDCT X X Xmerge MB data X X Xwrite frame X X X

Table 3. Data types accessed in each phase of decoder operation

Page 15: Chrome server2 print_http_

Each iteration through the macroblock loop accesses a new portion of one or two reference frames.Each new P or B frame repeats this traversal through its reference frame(s), recalling significantfractions ofeach frame back into the cache. During the ``write frame'' phase, the decoder spends mostof its time copying the current frame from the reference space to the output space. For all but thelargest caches, this has the effect of flushing out almost everything except video data for the nextdisplayed picture. The other data types will have to be reloaded for each new picture, or possiblyseveral times a picture for small caches, resulting in wasted cache-memory traffic.

<-- Decoder Internal Analysis

7.3 System and Audio Interaction

Our simulations assume that MPEG video decoding is the only active process. In reality, video isusually viewed in conjunction with audio data, which requires the concurrent execution of an MPEGaudio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expectthat running these other processes will not dramatically affect our results, since the amount of data theyoperate on at any given time is relatively small. The net effect is that of having a slightly smaller cache,and our simulations show that memory traffic levels are fairly insensitive to small changes in cache size.Nevertheless, the presence of these other programs is a good argument for keeping the cachefootprint of the video data as small as possible.

<-- Decoder Internal Analysis

<-- Table of Contents

8 Cache-Oriented EnhancementsOur analysis of decoder behavior and data composition motivates a specific approach to performanceoptimization, where the goal is to improve cache efficiency for video decoding without adverselyimpacting other performance metrics or other applications. In particular, there are great potentialbenefits in treating different data types distinctly, guided by their different sizes, access types, andaccess patterns.

For example, write-only values, like video output data, are clearly a waste of cache space. Likewise,while block and state data account for most of the memory requests, their relatively small size meansthat larger data types, like reference data, systematically crowd them out of the cache. Preventingblatant cache pollution and the predation of small data types with high re-use can yield considerablecache-memory traffic savings. To this end, we are evaluating the following traffic-reduction techniques.

Selective Caching. Exclude specific data from the cache.Cache Locking. Reserve selected cache lines for particular data objects; lines are untouched bycache replacement while locked.Scratch Memory. Implement a small portion of addressable system memory on the processor -not as cache - for storage of small, frequently-used data.Data Reordering. Perform cache-conscious modifications of decoder memory accesses.Cache Partitioning. Allocate different sections of the cache to different data types.

Many of these methods are relatively simple to implement, and most have precedents in otherarchitectures since they have performance benefits beyond video decoding. Perhaps their efficacy forMPEG-2 processing might encourage their broader adoption in architectures. In any case, the mainchallenge is to successfully apply these various techniques in a manner that enhances decoder

Page 16: Chrome server2 print_http_

challenge is to successfully apply these various techniques in a manner that enhances decoderperformance while avoiding excessive cost and complexity. The remainder of this section considerseach method on its own.

8.1 Selective Caching

Data objects with no possibility of re-use steal cache space from other data, which leads to excessmemory traffic when the excluded data are loaded in once again. Video output data is a perfectexample, since it is written but never read by the CPU, and occupies a fairly large amount of space.Bypassing the cache entirely in favor of direct storage to main memory promises more efficient use.

In our simulations, we have found that excluding video output data alone from the cache reduces cache-memory traffic by up to 50%. Across the configurations considered, not caching video output yields a25% reduction on average. Improvements in cache-memory traffic are global, yet the shape of thecurve is not substantially modified. The plateau out to 512 Kbytes persists, as does the swift drop downat 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of85% from earlier levels, and 60% on average.

There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load andstore instructions can contain ``locality hints'' which signal that the data referred to is unlikely to be used.If supported by the particular implementation, the processor can elect to bypass the cache with thedata [7]. The UltraSPARC has block load and store instructions, which move groups of data directlybetween registers and main memory at high speed [10]. This approach allows for potentially moreefficient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas ofmemory as non-cacheable. Once marked, data is transparently and automatically transferred betweenregisters and memory buffers. The cache hint and block load/store approaches require more explicitprogramming effort.

<-- Cache-Oriented Enhancements

8.2 Cache Locking

One way to protect small but frequently reused data types, like input, state, and tabular values, fromvictimization by raw video data is to lock the parts of the cache which contain the vulnerable data. Whilelocked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrownout only to be re-loaded when needed again.

This is accomplished by special machine instructions which execute the locking and unlocking. Thereare two basic variations on this technique. Static locking simply freezes the tag and contents of theaffected line, allowing for the writing of values but not replacement; the line is associated with the sameportion of main memory until unlocked. Dynamic locking is somewhat more flexible, treating lockedlines as an extension of the register set, with special instructions to copy contents directly to and frommain memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulationof legacy PC hardware standards.

<-- Cache-Oriented Enhancements

8.3 Scratch Memory

Another means of providing a safe haven for vulnerable data is to put them in a separate memory.Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easyto imagine implementing a portion of the memory space itself on the die - not as a cache butaddressable memory. Data which are subject to being prematurely ejected from the cache could then

Page 17: Chrome server2 print_http_

be kept in this space. The components of the MPEG-2 data which would most benefit from this arerelatively small, so only a few KBytes would be required. This ``scratch memory'' would be relativelyinexpensive to implement in hardware, smaller than most caches, and far simpler than any of them dueto the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been afeatured in Intel microcontrollers for quite some time, for storing Page 0 of the address space.

A scratch memory provides potentially very fast access to important data, since there is no possibilityof a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers evenuse special addressing modes to access Page 0 memory which are more concise and faster thanthose used for off-chip addresses. Unlike with cache locking, cache performance and capacity are notcompromised since the scratch memory is off to the side. Of course, getting a hardware feature used inembedded systems, where operating systems and even compilers are optional, to work effectively inan interactive, multitasking environment will take careful consideration, but the potential benefits areconsiderable.

<-- Cache-Oriented Enhancements

8.4 Data Reordering

This category of enhancement is a departure from our standard procedure of not tampering with thedecoder itself. Our goal is not to improve computational efficiency, a task which we leave to the manyothers working on that problem. However, we suggest that modifications of specifically how thedecoder software manages and addresses memory can provide significant improvements in cacheefficiency - some of which may require hardware support to properly implement.

For example, there is strictly speaking no essential need for the output data type. B frame data,considered separately, is never used by the decoder to construct other frames. The only time it is readis for copying to the output buffer space. I and P pictures need to remain available to serve asreference frames. Yet it turns out that in the natural flow of decoder data, by the time these frames arecopied to the output space they are no longer needed. If pictures could be transferred directly by DMAfrom the reference frame space where they are stored, rather than routed through the output data, thiswould result in greater efficiency. It would eliminate all of the cycle-consuming copying of output data,and transform the B frame data into write-only values which could be efficiently excluded from thecache.

<-- Cache-Oriented Enhancements

8.5 Cache Partitioning

It is possible to take the basic idea of cache locking in a slightly different direction. With cachepartitioning, particular data types are relegated to specific sections of the cache, which continue tobehave like cache. However, rather than locking in the physical address associated with a line, ortreating cache lines as extensions of the register set, it is as if each data type has its own separate,smaller cache space for storing its values.

This method provides very precise control over cache behavior, and an extremely powerful tool for theoptimal management of cache resources. However, it requires a level of hardware complexity notdemanded by the methods discussed previously. Also, the task of efficiently exploiting the availablecapabilities in software is considerably more challenging.

<-- Cache-Oriented Enhancements

Page 18: Chrome server2 print_http_

8.6 Summary

None of the methods proposed are mutually exclusive, although some of them, like cache locking andscratch memory, are arguably redundant if used in conjunction. Whether one of these techniques orsome combination of them provides the most efficient solution remains to be seen. Also, each oneraises issues of implementation cost and complexity, interaction with the operating system and otherprocesses, and software interfacing and portability which need to be more thoroughly addressed. Wewill investigate these issues and perform a more detailed evaluation of these architecturalenhancements in future work.

<-- Cache-Oriented Enhancements

<-- Table of Contents

9 ConclusionsThe effectiveness of computer memory subsystems, including caches, is currently a significant barrierto improved performance on multimedia applications. We have documented the memory traffic anddata cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstratedthat standard caches produce a significant excess of cache-memory traffic. While almost any cache issuperior to none - even simple caches can reduce required memory bandwidth considerably -experimenting with basic cache parameters like cache size, associativity, and line size has a limitedability to reduce cache-memory traffic. The best value than can be achieved is double the absoluteminimum traffic required. However, cache-oriented architectural enhancements, driven by anunderstanding of decoder behavior, can dramatically improve cache efficiency. These enhancementsinclude selective caching, scratch memory, data reordering, and cache partitioning. All have incommon an emphasis on treating different elements of the data distinctly, accommodating their uniqueproperties - size, usage patterns, temporal locality, etc. This approach provides for optimalmanagement of available cache resources. We expect that further refinement of these techniques willenable improved video decoding performance of general purpose microprocessors, and encouragethe broader proliferation of digital video platforms and applications.

<-- Table of Contents

10 AcknowledgmentsThis research was supported in part by NSF grant CCR-9696196. Miriam Leeser is the recipient of anNSF Young Investigator Award. The research is also supported by generous donations from SunMicrosystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for usefuldiscussions.

<-- Table of Contents

References[1]

Page 19: Chrome server2 print_http_

Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital VideoCompression: Algorithms and Technologies 1995, pages 100-109. SPIE, Jan. 1995.

[2]Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11-13,November 1996.

[3]Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6-9, December1996.

[4]International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2Committee Draft (MPEG-2), Nov. 1993.

[5]Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20-26, Mar./Apr.1997.

[6]James R. Larus. Efficient program tracing. IEEE Computer, 26(5):52-61, May 1993.

[7]Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.

[8]Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf.on Multimedia, pages 75-82. ACM, Aug. 1993.

[9]Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25-38, Jan. 1997.

[10]Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10-20, August 1996.

[11]Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia.Technical Report CSL-TR-95-683, Stanford University Depts. of Electrical Engineering andComputer Science, Stanford, CA, Dec. 1995.

<-- Table of Contents