77
Computer Engineering Mekelweg 4 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2005 MSc THESIS Optimisation of Multimedia Applications for the Philips Wasabi Multiprocessor System Demid Borodin Abstract Faculty of Electrical Engineering, Mathematics and Computer Science CE-MS-2005-05 Libavcodec is an open source library that contains many different au- dio/video codecs, including a very fast MPEG4 codec. It has been optimised for several multimedia extensions such as Intel’s MMX, AMD’s 3DNow, etc. Wasabi is a chip multiprocessor targeted at me- dia applications. It is being developed at Philips. Wasabi consists of several TriMedia DSPs and one or more general-purpose CPUs. The TriMedia is a VLIW processor that supports many media operations. This project is focused on (1) porting libavcodec to the TriMedia, (2) improvement of its performance by applying architecture-specific optimisations, (3) parallelisation of libavcodec, and (4) providing an interface between the TriMedia(s) executing libavcodec and the general-purpose processor(s) running application(s) that use libav- codec. First we describe how libavcodec was ported to the TriMedia. Libavcodec supports only the most recent gcc compilers, whereas the TriMedia compiler accepts standard ANSI C. Hence certain C con- structs had to be replaced by others and certain library functions had to be implemented. Thereafter, we describe how libavcodec was optimised for the TriMedia. The optimisations improve the perfor- mance of the MPEG2 and MPEG4 decoders by approximately 41% and 30%, respectively. Parallelisation of the encoder involved chang- ing the interface from Windows threads to the TriMedia multithreading API TM OSAL. A linear speedup has been achieved for up to 6 CPUs, further it slightly levels off. Additional work is required to make libavcodec more scalable so that it can exploit more processors efficiently. Finally, an interface between the TriMedia(s) executing libavcodec and host applications that use libavcodec is proposed. This interface enables applications to efficiently use libavcodec running on the TriMedias, without having to port the applications themselves to the TriMedia.

Optimisation of Multimedia Applications for the Philips Wasabi Multiprocessor System

Embed Size (px)

Citation preview

Computer EngineeringMekelweg 4

2628 CD DelftThe Netherlands

http://ce.et.tudelft.nl/

2005

MSc THESIS

Optimisation of Multimedia Applications for thePhilips Wasabi Multiprocessor System

Demid Borodin

Abstract

Faculty of Electrical Engineering, Mathematics and Computer Science

CE-MS-2005-05

Libavcodec is an open source library that contains many different au-dio/video codecs, including a very fast MPEG4 codec. It has beenoptimised for several multimedia extensions such as Intel’s MMX,AMD’s 3DNow, etc. Wasabi is a chip multiprocessor targeted at me-dia applications. It is being developed at Philips. Wasabi consists ofseveral TriMedia DSPs and one or more general-purpose CPUs. TheTriMedia is a VLIW processor that supports many media operations.This project is focused on (1) porting libavcodec to the TriMedia,(2) improvement of its performance by applying architecture-specificoptimisations, (3) parallelisation of libavcodec, and (4) providingan interface between the TriMedia(s) executing libavcodec and thegeneral-purpose processor(s) running application(s) that use libav-codec. First we describe how libavcodec was ported to the TriMedia.Libavcodec supports only the most recent gcc compilers, whereas theTriMedia compiler accepts standard ANSI C. Hence certain C con-structs had to be replaced by others and certain library functionshad to be implemented. Thereafter, we describe how libavcodec wasoptimised for the TriMedia. The optimisations improve the perfor-mance of the MPEG2 and MPEG4 decoders by approximately 41%and 30%, respectively. Parallelisation of the encoder involved chang-

ing the interface from Windows threads to the TriMedia multithreading API TM OSAL. A linear speeduphas been achieved for up to 6 CPUs, further it slightly levels off. Additional work is required to makelibavcodec more scalable so that it can exploit more processors efficiently. Finally, an interface betweenthe TriMedia(s) executing libavcodec and host applications that use libavcodec is proposed. This interfaceenables applications to efficiently use libavcodec running on the TriMedias, without having to port theapplications themselves to the TriMedia.

Optimisation of Multimedia Applications for thePhilips Wasabi Multiprocessor System

THESIS

submitted in partial fulfillment of therequirements for the degree of

MASTER OF SCIENCE

in

COMPUTER ENGINEERING

by

Demid Borodinborn in Samarkand, USSR

Computer EngineeringDepartment of Electrical EngineeringFaculty of Electrical Engineering, Mathematics and Computer ScienceDelft University of Technology

Optimisation of Multimedia Applications for thePhilips Wasabi Multiprocessor System

by Demid Borodin

Abstract

Libavcodec is an open source library that contains many different audio/video codecs, in-cluding a very fast MPEG4 codec. It has been optimised for several multimedia extensionssuch as Intel’s MMX, AMD’s 3DNow, etc. Wasabi is a chip multiprocessor targeted at

media applications. It is being developed at Philips. Wasabi consists of several TriMedia DSPsand one or more general-purpose CPUs. The TriMedia is a VLIW processor that supports manymedia operations.

This project is focused on (1) porting libavcodec to the TriMedia, (2) improvement of its per-formance by applying architecture-specific optimisations, (3) parallelisation of libavcodec, and(4) providing an interface between the TriMedia(s) executing libavcodec and the general-purposeprocessor(s) running application(s) that use libavcodec. First we describe how libavcodec wasported to the TriMedia. Libavcodec supports only the most recent gcc compilers, whereas theTriMedia compiler accepts standard ANSI C. Hence certain C constructs had to be replacedby others and certain library functions had to be implemented. Thereafter, we describe howlibavcodec was optimised for the TriMedia. The optimisations improve the performance of theMPEG2 and MPEG4 decoders by approximately 41% and 30%, respectively. Parallelisation ofthe encoder involved changing the interface from Windows threads to the TriMedia multithread-ing API TM OSAL. A linear speedup has been achieved for up to 6 CPUs, further it slightly levelsoff. Additional work is required to make libavcodec more scalable so that it can exploit more pro-cessors efficiently. Finally, an interface between the TriMedia(s) executing libavcodec and hostapplications that use libavcodec is proposed. This interface enables applications to efficiently uselibavcodec running on the TriMedias, without having to port the applications themselves to theTriMedia.

Laboratory : Computer EngineeringCodenumber : CE-MS-2005-05

Committee Members :

Advisor: Ben H.H. Juurlink, CE, TU Delft

Advisor: Paul Stravers, IT, Philips Research Eindhoven

Advisor: Andrei Terechko, IT, Philips Research Eindhoven

Chairperson: Stamatis Vassiliadis, CE, TU Delft

Member: Georgi Gaydadiev, CE, TU Delft

Member: Frans Ververs, SE, TU Delft

i

ii

Contents

List of Figures vii

List of Tables ix

Acknowledgements xi

1 Introduction 11.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Work Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Work Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 5

3 Background 73.1 libavcodec and FFmpeg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 FFmpeg project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 libavcodec library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Philips TriMedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Philips Wasabi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Porting FFmpeg to the TriMedia 154.1 The C Language Supported by the TriMedia Compiler . . . . . . . . . . . 154.2 The C Language Used in the FFmpeg Project . . . . . . . . . . . . . . . . 154.3 Unsupported Library Functions and Type Declarations . . . . . . . . . . . 174.4 Operating System Specific Features . . . . . . . . . . . . . . . . . . . . . . 174.5 Other Porting Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Optimisation of Libavcodec for the TriMedia 195.1 Optimisation for the TriMedia Architecture . . . . . . . . . . . . . . . . . 195.2 Which Parts Optimise? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Optimisation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3.1 TriMedia Custom Operations . . . . . . . . . . . . . . . . . . . . . 205.3.2 Algebraic simplifications . . . . . . . . . . . . . . . . . . . . . . . . 225.3.3 Grafting and Loop Unrolling . . . . . . . . . . . . . . . . . . . . . 225.3.4 Restricted pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.5 Function Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.6 Branch Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.7 Automatic Compiler Optimisation . . . . . . . . . . . . . . . . . . 24

5.4 Automatic Transformation of MMX to TriMedia Code . . . . . . . . . . . 25

iii

5.4.1 MMX to TriMedia Tool . . . . . . . . . . . . . . . . . . . . . . . . 255.4.2 Using MMX to TriMedia Tool for Libavcodec . . . . . . . . . . . . 255.4.3 Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5.1 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . 285.5.2 Simulation Tool and Benchmark . . . . . . . . . . . . . . . . . . . 285.5.3 Input Video Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Parallelisation of Libavcodec for Wasabi 336.1 Parallelisation of the Original Libavcodec . . . . . . . . . . . . . . . . . . 336.2 Parallelisation of Libavcodec for Wasabi . . . . . . . . . . . . . . . . . . . 336.3 Parallelisation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Interfacing Libavcodec Running on TriMedia with Applications on aGeneral Purpose Processor 397.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.3 “Wrapper” Synchronisation . . . . . . . . . . . . . . . . . . . . . . . . . . 417.4 Data Transfer Between GPP and TriMedia . . . . . . . . . . . . . . . . . 42

7.4.1 Data to Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.4.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8 Conclusions and Future Work 498.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.2 Conclusions and Recommendations . . . . . . . . . . . . . . . . . . . . . . 508.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 54

A libavcodec and libavformat API 55A.1 libavcodec API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.1.1 void av register all(void) . . . . . . . . . . . . . . . . . . . . . . . . 55A.1.2 AVCodec *avcodec find decoder(enum CodecID id) . . . . . . . . . 55A.1.3 AVCodec *avcodec find decoder by name(const char *name) . . . 55A.1.4 AVCodec *avcodec find encoder(enum CodecID id) . . . . . . . . . 55A.1.5 AVCodec *avcodec find encoder by name(const char *name) . . . 55A.1.6 int avcodec open(AVCodecContext *avctx, AVCodec *codec) . . . 56A.1.7 AVFrame *avcodec alloc frame(void) . . . . . . . . . . . . . . . . . 56A.1.8 AVCodecContext *avcodec alloc context(void) . . . . . . . . . . . 56A.1.9 int avpicture get size(int pix fmt, int width, int height) . . . . . . 56A.1.10 int avpicture fill(AVPicture *picture, uint8 t *ptr, int pix fmt, int

width, int height) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv

A.1.11 int img convert( AVPicture *dst, int dst pix fmt, const AVPicture*src, int src pix fmt, int src width, int src height ) . . . . . . . . . 56

A.1.12 void av free(void *ptr) . . . . . . . . . . . . . . . . . . . . . . . . . 56A.1.13 int avcodec close(AVCodecContext *avctx) . . . . . . . . . . . . . 56A.1.14 int avcodec decode video( AVCodecContext *avctx, AVFrame

*picture, int *got picture ptr, uint8 t *buf, int buf size ) . . . . . . 57A.1.15 int avcodec encode video( AVCodecContext *avctx, uint8 t *buf,

int buf size, const AVFrame *pict ) . . . . . . . . . . . . . . . . . . 57A.1.16 int avcodec decode audio(AVCodecContext *avctx, int16 t *sam-

ples, int *frame size ptr, uint8 t *buf, int buf size) . . . . . . . . . 57A.1.17 int avcodec encode audio( AVCodecContext *avctx, uint8 t *buf,

int buf size, const short *samples) . . . . . . . . . . . . . . . . . . 57A.2 libavformat API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.2.1 int av open input file(AVFormatContext **ic ptr, const char *file-name, AVInputFormat *fmt, int buf size, AVFormatParameters *ap) 57

A.2.2 void av close input file(AVFormatContext *s) . . . . . . . . . . . . 58A.2.3 int av find stream info(AVFormatContext *ic) . . . . . . . . . . . 58A.2.4 void dump format(AVFormatContext *ic, int index, const char

*url, int is output) . . . . . . . . . . . . . . . . . . . . . . . . . . . 58A.2.5 int av read frame( AVFormatContext *s, AVPacket *pkt ) . . . . . 58A.2.6 int av write header(AVFormatContext *s) . . . . . . . . . . . . . . 58

A.3 Several techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3.1 Find the first video stream . . . . . . . . . . . . . . . . . . . . . . 59A.3.2 Get a pointer to the codec context for the video stream . . . . . . 59A.3.3 Find the decoder for the video stream . . . . . . . . . . . . . . . . 59A.3.4 Find the MP2 encoder . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.4 Get more information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

v

vi

List of Figures

3.1 TM1100 block diagram (taken from [17]) . . . . . . . . . . . . . . . . . . . 103.2 TriMedia instruction with 5 operations . . . . . . . . . . . . . . . . . . . . 113.3 ifir16 TriMedia custom operation . . . . . . . . . . . . . . . . . . . . . . . 113.4 Wasabi scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1 pack16lsb TriMedia custom operation . . . . . . . . . . . . . . . . . . . . . 215.2 GNU inline assembly to macros conversion steps. . . . . . . . . . . . . . . 265.3 Code transformation steps (from left to right) taken to convert GNU inline

assembly to macros: the GNU inline assembly is transformed to Intel inlineassembly, than the latter is converted to the macros form. . . . . . . . . . 27

5.4 Performance before and after optimisation . . . . . . . . . . . . . . . . . . 30

6.1 Time-line illustrating how the libavcodec video encoder is multithreaded. 346.2 Speedup as a function of the number of CPUs. . . . . . . . . . . . . . . . 356.3 Thread activity during the encoding of one video frame on 8 processors

(the image is provided by the Wasabi performance analysis tool TimeDoc-tor). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.1 Mapping of tasks on processors in Wasabi chip . . . . . . . . . . . . . . . 397.2 Libraries - “wrappers” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.3 GPP-side “wrapper” structure . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

viii

List of Tables

3.1 libavcodec versus XviD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.1 List of video files used for performance analysis. All mp2 audios are 44100Hz, stereo, 64 kb/s; all mp3 audios are 48000 Hz, stereo, 128 kb/s. TheMPEG4 files were encoded with DivX. . . . . . . . . . . . . . . . . . . . . 29

5.2 Performance before and after optimisation . . . . . . . . . . . . . . . . . . 295.3 Performance gain of IDCT algorithm’s optimisation without forcing inlining 315.4 Performance gain of IDCT algorithm’s optimisation applied together with

forcing inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.1 Number of cycles and frequency required to encode video and audio fordifferent numbers of processors. . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

x

Acknowledgements

I would like to thank Ben Juurlink, my advisor at TU Delft, the whole project becametrue because of his help. He also helped me very much reviewing the report: the contents,the structure and the English language have improved very much thanks to him.

Thanks to everybody who supported me in my work and in private life while doing theproject in Eindhoven: Andrei Terechko, who was one of the two my advisers at Philipsand who spent very much of his own time helping me; Paul Stravers, who organised theproject; all my friends – colleagues and roommates – with whom I could both discussprofessional topics and spend free time in the most pleasant way.

Thanks to Philips for giving me the opportunity to do this research project.Thanks to my family and all friends in Uzbekistan, Russia, Delft and all over the

world who provided me with a long-distance support.

Demid BorodinDelft, The NetherlandsJuly 9, 2005

xi

xii

Introduction 1In this chapter the motivation and context of the research project is presented.

1.1 Problem Description

The demand for computational power increases all the time in the field of digital mul-timedia. The power of dedicated multimedia processors working alone has become in-sufficient. Currently Philips considers utilising parallel multiprocessor architectures formultimedia processing. The advantages of dedicated multimedia processors such as thePhilips TriMedia combined with parallelisation should open new possibilities for compu-tationally intensive multimedia tasks.

TriMedia [17] is a very large instruction word (VLIW) processor with a large numberof custom operations dedicated for common multimedia operations. It does not have ahigh clock frequency since it is targeted at low-cost devices. But when it is programmedin the right way, it can perform multimedia tasks very quickly, with speed comparableto that of powerful general purpose processors. The key reasons for that are its VLIWarchitecture and the multimedia custom operations implemented in hardware. TriMediacan perform much more work in one cycle than a general-purpose processor.

Philips Wasabi, which is an instance of the CAKE architecture [27, 29], the multi-media multiprocessor chip currently being developed in Philips Research, creates newopportunities for optimisation: parallelisation of applications. Wasabi consists of severalTriMedia DSPs and one or more general-purpose CPUs that communicate through ashared level-2 cache.

Libavcodec [5] is an open source software library that contains many different au-dio/video coders/decoders (codecs) and is known to be a very fast MPEG4 codec[31, 14, 3]. It has been optimised for several multimedia extensions such as Intel’sMMX, AMD’s 3DNow, etc. Except the native project FFmpeg, many other projectsutilise libavcodec.

The goals of this project are (1) to port libavcodec to the TriMedia, (2) to improveperformance by applying architecture-specific optimisations, (3) to parallelise libavcodecto exploit multiple processors, and (4) to provide an interface between the TriMediaprocessor(s) executing libavcodec and the general-purpose processor(s) running applica-tion(s) that use libavcodec.

Libavcodec supports only the most recent gcc compilers, whereas the TriMedia com-piler accepts standard ANSI C. Hence, to port libavcodec to TriMedia, certain C con-structs had to be replaced by others and certain library functions had to be implemented.

1

2 CHAPTER 1. INTRODUCTION

Many techniques exist for optimising programs for the TriMedia. The main idea be-hind them is utilising the instruction level and data level parallelism available due to thespecial features of the architecture. This can be done by changing the code manually,or by restructuring the code in such a way that the compiler can schedule it more effi-ciently. Among the optimisations that have been applied are conventional optimisationtechniques such as algebraic simplifications and function inlining, techniques to increaseinstruction-level parallelism such as branch elimination, and restricted pointers to informthe compiler that there is no aliasing. In addition, the TriMedia custom operations wereused. Among these custom operations there are SIMD1 operations that process severalshort data types simultaneously.

After optimising the library for the TriMedia, it was parallelised for Wasabi. Be-cause the original has already been parallelised using Windows multithreading API andPOSIX Pthreads, parallelisation was performed by changing the interface to the TriMe-dia multithreading API TM OSAL.

Finally an interface between the TriMedia(s) executing libavcodec and the general-purpose processor(s) running application(s) that use libavcodec is proposed. This inter-face enables applications to use libavcodec running on the TriMedia, without having toport the applications themselves to the TriMedia.

1.2 Work Structure

To achieve the given objectives, the following main steps have been taken in succession:

1. Porting libavcodec/FFmpeg to the Philips TriMedia. FFmpeg exploits many fea-tures of the most recent GNU C compiler (gcc) that are not supported by theTriMedia compiler. In order to port FFmpeg to the TriMedia, some changes havebeen made.

2. Optimisation of libavcodec for the TriMedia. Some architecture- specific optimi-sations have been applied to libavcodec to achieve greater processing speed. Theoptimisations made in this work mostly target video decoding.

3. Parallelisation of libavcodec for Wasabi. Originally video encoding in libavcodecwas parallelised with pthreads and Windows threads. Support for TM OSALinterface has been added.

4. Designing an interface between applications running on Wasabi’s general-purposeprocessor and libavcodec running on the TriMedia processor(s).

1.3 Work Organisation

This report is organised as follows:Chapter 2 presents related work.

1Single-instruction, multiple-data

1.3. WORK ORGANISATION 3

Chapter 3 provides various background information needed for understanding thework. It describes libavcodec and FFmpeg, the TriMedia and the Wasabi chip multipro-cessor.

Chapter 4 explains how libavcodec/FFmpeg was ported to the TriMedia.Chapter 5 describes how libavcodec was optimised for the TriMedia. Different opti-

misation approaches are listed. After that the results of applying the optimisations tolibavcodec are presented. It considers only the optimisations for the TriMedia, withoutparallelisation for Wasabi.

Chapter 6 describes how libavcodec was parallelised for Wasabi multiprocessor andpresents the results of this parallelisation.

Chapter 7 proposes a way to interface applications running under Linux on a general-purpose processor with libavcodec running on the TriMedia processor(s).

Chapter 8 draws conclusions of this work and provides suggestions for future work.

4 CHAPTER 1. INTRODUCTION

Related Work 2This chapter gives an overview of previous work related to software optimisation for thePhilips TriMedia and Wasabi. It describes the valuable information that can be found inTriMedia documentation and the work on applications optimisation and parallelisationfor the TriMedia and Wasabi.

The documentation of the TriMedia provides a good overview of software optimisationmethods for the TriMedia.

Chapter 4, “Porting and Optimising Programs”, of the TriMedia Cookbook [18] de-scribes what should be taken into account when porting applications to the TriMedia,how to compile and optimise them. Chapter 11, “Optimising for TriMedia”, also dis-cusses software optimisation for TriMedia, but is mostly focused on compiler issues,automatic optimisations, and manual methods to improve the gain of automatic optimi-sations. Chapter 6, “Case Studies”, presents many optimisation methods with examples.Chapter 10, “Performance Analysis Overview”, explains the performance analysis toolsprovided in the TriMedia Compilation System.

Information about the TriMedia custom operations, a short list of them, and severalexamples are given in [17].

R. Jochemsen [12] has worked on mapping the H.264 main profile video decoder toWasabi aiming at real-time high-definition H.264 decoding. The work shows what kindof programming issues are encountered during the mapping of a modern multimediaalgorithm onto a multimedia processor. Basically it is focused on optimisation for asingle TriMedia processor, it does not deal with parallelisation. On the contrary, it evenstates that the optimisation work does not take into account that it has to deal witha multiprocessor. The work presents an investigation of the possibilities to optimiseseveral of the most time-consuming functions of the H.264 decoder. The author startedwith a slightly optimised publicly available H.264 reference software implementation.According to the presented calculations that assume a good parallelisation, to achievereal-time decoding on Wasabi with 8 TriMedia CPUs, it was necessary to attain anoverall performance improvement by a factor 4. The 4 top functions were optimisedwith the improvement ratio from 1.2 to 5.1. The author concludes that much more codeneeds optimisation and that to reach the goal (real-time high definition H.264 decodingon Wasabi) will be very hard or even impossible.

Li et al. [13] have mapped two 3D-TV rendering algorithms onto Wasabi. The goal

5

6 CHAPTER 2. RELATED WORK

was to achieve real-time execution of the algorithms on a Wasabi chip and, if possible, tominimise the number of TriMedia CPUs occupied with this task, to have the possibilityto use some of them for other tasks. The sequential C implementations of the algorithmswere optimised for the TriMedia core and, after that, parallelised for Wasabi. The firstof the two algorithms under consideration could be run on a single TriMedia processor inreal-time, but had the disadvantage that the viewer had to use a pair of special glassesto enjoy the 3D effect. The second algorithm did not require the glasses, but could berun in real-time only on 16 TriMedia processors. Both algorithms scale well with thenumber of processors.

The spacecake team [28] describes how the software MPEG2 decoder obtained from[3] was parallelised for Wasabi. The authors decided to decode every slice of a frame in adifferent thread. The scheduler in Wasabi distributes these threads over the processors.With a small effort (less than a person-week) the authors transformed the sequentialdecoder to a multi-threaded program. The speedup achieved was almost linear for up to8 TriMedia processors. From 12 processors onwards, increasing the number of processorsdid not reduce the execution time substantially.

Background 3This chapter is dedicated to the topics needed to understand the work. It describes libav-codec and FFmpeg, explains the relationship between them, and compares libavcodec toanother MPEG4 codec (XviD). Further more, some information about Philips TriMediaand Wasabi is given.

3.1 libavcodec and FFmpeg

3.1.1 FFmpeg project

FFmpeg [5] is an open source project for digital audio and video processing. It is acomplete solution to record, convert, and stream digital audio and video.

The FFmpeg project consists of three programs:

• ffmpeg : a command line tool to convert one video file format to another. It alsosupports grabbing and encoding in real time from a TV card.

• ffplay : a media player based on the main FFmpeg libraries.

• ffserver : an HTTP multimedia streaming server for live broadcasts.

FFmpeg contains two libraries: libavcodec and libavformat. The former is a set ofall audio/video codecs FFmpeg can employ, the latter is a set of parsers and generatorsfor common audio/video file formats. The command

ffmpeg -formats

can be used to list the file formats and codecs supported by libavformat and libavcodec.Further on in this work, only the ffmpeg command line tool is meant by “FFmpeg”.

ffplay and ffserver are not interesting here since the target is actually libavcodec, whichcan be tested in the best way with the ffmpeg tool. The ffmpeg command line tool(and the libavformat library) are used only as a software tool utilising the libavcodeclibrary. Since grabbing a live audio/video source is an operating system specific task andnot related to libavcodec, this work is focused on the converting (encoding/decoding ofaudio/video) capability of FFmpeg only.

FFmpeg has an intuitive command line interface in the sense that the parametersnot specified by the user are inferred automatically. For example, the command

ffmpeg -i in.avi out.mpg

7

8 CHAPTER 3. BACKGROUND

converts the file in.avi to the file out.mpg. The libavformat library, reading the inputfile, automatically determines the codec which libavcodec needs to apply to convertthe input data to the internal (raw) audio/video format. The output codec and itsparameters are inferred from the output file’s extension. For the mpg file extension theinferred parameters are: MPEG1 video codec (and its internal settings), 200 kb/s bitrate, and the resolution and frame rate are the same as in the input file. Then libavcodecconverts the audio/video data to the required format and libavformat writes the file.

Another example of ffmpeg command:

ffmpeg -s 720x480 -i in.yuv -i in.wav -b 800 -vcodec mpeg4 out.avi

In this example some desired output file’s parameters were passed to FFmpeg. Twoinput files, in.yuv with raw video data of resolution 720×480 and in.wav with raw audiodata, will be combined and written to the file out.avi. The video bit rate is set to 800kb/s and the MPEG4 codec is forced.

These are just two examples. FFmpeg has much more features. For a discussion ofall features, the reader is referred to the project homepage or its man-page.

Originally written for Linux, FFmpeg can also be compiled for several other operatingsystems, including Microsoft Windows.

3.1.2 libavcodec library

libavcodec is an audio/video codec library. It supports a large number of different codecs.libavcodec is used in many other projects besides FFmpeg, for example, in MPlayer ([4]).A list of the projects using libavcodec can be found at [1].

libavcodec is widely known to be a very fast library, one of the quickest MPEG4codecs. On the other hand, in the thorough codecs’ quality comparison in [7], wherelibavcodec is represented as ffvfw, the speed of ffvfw does not impress: for example, intwo-pass encoding ffvfw was more than 3 times slower than XviD.

The author of the current work made some rough experiments to compare XviD andlibavcodec and got results completely different from those obtained in [7]. The reasonfor that should be in the environment used: in [7] all the experiments were made on aWindows system, and libavcodec was a part of another project (ffvfw). In the currentwork, the experiments were performed on the Linux system since both libavcodec andXviD are Linux-native libraries. The same application, mencoder (an encoder in MPlayerproject), was used as an interface to the original codec libraries, so the libraries wereaffected by the application in the same way.

Both codecs are asked to encode the video with the bitrate 4000 kb/s (note: here1 kb = 1000 Bits), with maximum one B-frame in a raw, using MPEG quantisation(according to MPlayer’s manpage, the MPEG quantisation preserves more detail thanthe H.263 one for high bitrates). The other parameters are kept default. The encodingwas done both in 1- and 2-pass mode. The following commands were used for 1-passencoding:for libavcodec:

time mencoder infile.mpeg -ovc lavc -oac copy -lavcoptsvbitrate=4000:vmax b frames=1:mpeg quant -o outfile lavc 1pass.avi

3.2. PHILIPS TRIMEDIA 9

for XviD:

time mencoder infile.mpeg -ovc xvid -oac copy -xvidencoptsbitrate=4000:max bframes=1:quant type=mpeg -o outfile xvid 1pass.avi

The time utility determined the exact processing time. The results are presented inTable 3.1.

Codec # passes Processing time Encoded file size (KB)libavcodec 1 3 min 58.58 s 162 792XviD 1 12 min 43.72 s 163 133libavcodec 2 6 min 42.20 s 163 621XviD 2 12 min 51.25 s 163 377

Table 3.1: libavcodec versus XviD.

In the 1-pass mode libavcodec performs almost 4 times as fast as XviD, in the 2-passmode it is about twice as fast. The sizes of the encoded movies are approximately thesame. The author could hardly see any difference in the quality of the encoded movies,but XviD seems to do a little better in this sense. An interested reader can see [7] fora thorough quality comparison of the video delivered by these codecs which leads to theconclusion that XviD is considerably better. Hence libavcodec trades some quality for agreat advantage in the speed.

The most computationally-intensive parts of libavcodec have been optimised for In-tel’s MMX [16], AMD’s 3DNow [8], and other dedicated multimedia architectures. Oneof the goals of this project is to add support for the TriMedia.

3.2 Philips TriMedia

Philips TriMedia1 [17] is a media processor targeted at multimedia applications dealingwith high-quality audio and video. These applications can range from low-cost, dedi-cated systems such as video phones or set-top boxes to re-programmable, multi-purposeplug-in cards for personal computers. It can function, for example, as a video com-pression/decompression engine on a PCI card in a PC. In other words, the TriMediafeatures the advantages of both a special purpose, low cost embedded solution and ageneral-purpose, programmable processor.

TriMedia has a powerful general-purpose VLIW (very long instruction word) proces-sor core (the DSPCPU). A small real-time operating system driven by interrupts fromother units runs on the DSPCPU. It coordinates all the on-chip activities. There areseveral DMA-driven multimedia input/output units and co-processors operating inde-pendently on the TriMedia chip. A high-performance bus and memory system providecommunication between the processing units. Figure 3.1 depicts a block diagram of theTM1100 chip.

1The description is based on the TM1100 processor

10 CHAPTER 3. BACKGROUND

Figure 3.1: TM1100 block diagram (taken from [17])

The DSPCPU and peripherals are time-shared and the communication between unitsis performed through SDRAM memory. The DSPCPU switches from one task to another,issuing commands to the peripheral functional units to control their work.

The 32-bit DSPCPU has a 32-bit linear address space and 128 general- purpose 32-bit registers. The DSPCPU core has split 16 KB data and 32 KB instruction caches.The data cache is dual-ported to allow two simultaneous accesses, and both caches areeight-way set-associative with a 64-byte block size.

The TriMedia instruction set includes all the traditional RISC microprocessor opera-tions. In addition, it supports-special purpose multimedia operations that can acceleratestandard audio/video algorithms dramatically.

TriMedia operations are guarded. That is, the execution of an operation dependsupon the content of a general purpose register. If the register contains zero, the operationbehaves as a NOP (no-operation). If it is one, the operation executes. This techniqueallows to exploit more instruction level parallelism [10].

The key approach taken in the TriMedia to achieve the highest possible performanceis the VLIW architecture. The VLIW architecture maximises processor throughput atthe lowest possible cost. Compared to superscalar, a VLIW architecture has lower cost

3.2. PHILIPS TRIMEDIA 11

because of simpler hardware logic. In superscalar architectures, parallelism must bediscovered by the hardware dynamically at runtime, while in a VLIW architecture theinstruction level parallelism is explicit. This advantage comes at the expense of morecomplex compiler support.

The TriMedia VLIW processor allows to issue 5 operations in one instruction (clockcycle). In turn, each operation may contain several arithmetic operations. In otherwords, more than 5 RISC-like operations can be executed in one clock cycle on theTriMedia (Figure 3.2). Just one of five operations in a TriMedia VLIW instructioncan implement up to 11 traditional microprocessor operations. These features of theTriMedia, if properly used, result in tremendous throughput for multimedia applications.

Figure 3.2: TriMedia instruction with 5 operations

An example of a custom operation performing several arithmetic operations is ifir16which is a sum of products of signed 16-bit halfwords (see Figure 3.3):

ifir16 (a, b) = aHI × bHI + aLO × bLO

Figure 3.3: ifir16 TriMedia custom operation

The latency of the ifir16 operation is three clock cycles, the same as that of the imuloperation (32-bit integer multiplication).

12 CHAPTER 3. BACKGROUND

However, the ifir16 operation, as most other operations, cannot use just any of thefive issue slots in a TriMedia instruction. The reason is that each functional unit isassigned to specific issue slots in a VLIW instruction. For example, ifir16 and imuloperations can use only the second and third issue slots in a VLIW instruction. Thereare also other constraints on the operations that can be packed into one instruction. Forexample, at most two load/store operations are allowed in a single instruction.

Because the number of available functional units is limited and assignment of oper-ations to instruction slots is restricted, some operations cannot be executed as soon astheir operands are available. If some operations have to wait when the necessary slotsbecome available for them, while other slots are idle, and the scheduler cannot rearrangethe instructions’ order because of data dependencies, not all the issue slots are in use.This problem is very typical for unoptimised applications on the TriMedia. Good use ofissue slots is one of the purposes of software code optimisation. Although it is a task ofthe scheduler to pack the operations properly, a programmer has to take into accountall these details to create code that utilises more or less completely the power of theprocessor.

3.3 Philips Wasabi

Wasabi is an instance of the CAKE architecture [27, 29], a multimedia chip being cur-rently developed in Philips Research. It is a multiprocessor consisting of several proces-sors (8 are currently targeted), each of them with its own level-1 cache, and a sharedlevel-2 cache (see Figure 3.4). The Wasabi chip contains fast message passing channelsand some application-specific co-processors. The memory access is governed by a cachecoherency protocol. Wasabi addresses the future multimedia market, aims processing ofhigh definition video in real-time and other computationally-intensive tasks that cannotbe done by currently existing multimedia processors.

Figure 3.4: Wasabi scheme.

Wasabi aims to provide an industry standard programming model with a single sharedaddress space, cache coherence implemented in hardware, and high-speed synchronisationprimitives.

3.3. PHILIPS WASABI 13

Because the caches are kept coherent and because thread scheduling is controlledby the operating system, a programmer can view a cluster of TriMedia processors asa single core. If a user wants to employ multiple processors, he or she should providemultiple parallel tasks (threads). The question to be answered by the programmer hereis what and how to parallelise the code. It does not make sense to create many fine-grain tasks, because the communication overhead may exceed the benefits achieved byparallelisation of the code. So it makes sense to divide the work into coarse-grain tasksfor each processing unit.

There are also different parallelisation approaches: code and data parallelisation.With the code parallelisation different tasks (algorithms) are assigned to different pro-cessors. Data parallelisation means that all the processors run the same algorithm, butprocess different data (for example, in video processing the same function can be run onall the processors, while each processor works with its own slice of a video frame). Thechoice between code and data parallelisation depends on the application.

Wasabi combines the power of the TriMedia (instruction-level parallelism) and thatof a parallel multiprocessor architecture (thread-level parallelism).

14 CHAPTER 3. BACKGROUND

Porting FFmpeg to theTriMedia 4In this chapter we describe the difficulties encountered when porting FFmpeg to the Tri-Media. In section 4.1 we describe the C language supported by the TriMedia compiler.Section 4.2 contains a description of the C language used in the FFmpeg project and whatkind of changes were made to transfer it to TriMedia. Section 4.3 describes the libraryfunctions used in FFmpeg but not supported in TriMedia compilation system. Severalother porting issues are discussed further.

4.1 The C Language Supported by the TriMedia Compiler

As stated in [19], the C programming language accepted by the TriMedia compiler tmccis based on the following standards:

• American National Standard for Programming Languages–C, ANS X3.159–1989

• ISO/IEC 9899:1990

• Technical Corrigendum 1 (1994) to ISO/IEC 9899:1990

• IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE 754-1985

“ANS X3.159–1989” is the basis for the later ISO standard “ISO/IEC 9899:1990,”which includes the prior standard entirely, with only minor editorial changes. Furtherin the text, the ISO C is meant by “ANSI C” or “Standard C”.

There are also extensions to ANSI C in the TriMedia compiler. These extensions areones either likely to be approved in future standard revisions, or ones in common use inother ANSI C compilers. The language extensions that conflict with the ISO C languagestandard are controlled by compile time switches, allowing for verification of compliancewith the standard.

In addition, the compiler supports the concept of restricted pointers, as proposed bythe Numerical C extensions group in the proposal X3J11/95-049, WG 14/N448 [11].

4.2 The C Language Used in the FFmpeg Project

FFmpeg is written for the most recent GNU C compilers. It utilises many of its specificfeatures that are not supported by the TriMedia compiler. Several examples are givenbelow.

15

16 CHAPTER 4. PORTING FFMPEG TO THE TRIMEDIA

• Structure assignments. Given a structure definition such as

typedef struct {char ch;int i;float f;

} struct_type;

the structure assignment

struct_type struct_name;struct_name = (struct_type) { ’a’, 50, 0.5 };

must be replaced by

struct_type struct_name;struct_name.ch = ’a’;struct_name.i = 50;struct_name.f = 0.5;

• Structure initialisation. For the same structure type as in the previous example,the initialisation

struct_type struct_name = { .f=0.5, .i=50 };

should be replaced by

struct_type struct_name = { 0, 50, 0.5 };

in the source code for the TriMedia compiler.

• Declaration of arrays with non-constant dimensions. A declaration such as

int int_array[w]; /* w is not a constant */

should be replaced by a function call allocating memory dynamically:

int *int_array = (int*) malloc( w * sizeof(int) );

• The attributes of the GNU C compiler are not supported by TriMedia compiler.They should be carefully eliminated. For example, the attribute

__attribute__(unused)

can be simply removed since it is only used to let the compiler know that nowarnings should be given if variables are unused. Normally the compiler wouldgive a warning about an unused variable.

4.3. UNSUPPORTED LIBRARY FUNCTIONS AND TYPE DECLARATIONS 17

4.3 Unsupported Library Functions and Type Declarations

There are some C library type declarations and functions used in the FFmpeg projectthat are not supported by the TriMedia Compilation System (TCS).

To solve the problem of unsupported declarations, the directory tm include was addedto the FFmpeg project. The directory contains include files, sometimes with minorchanges, taken from Linux standard include directories.

Besides that several functions used in FFmpeg and are not present in TCS. Theyhad to be implemented. The following functions:

• int strcasecmp(const char* s1, const char* s2); – case-insensitive comparison oftwo strings.

• int snprintf(char* str, size t size, const char* format, ...); – printing to the stringstr of maximum size symbols based on the format specified in format.

• int vsnprintf(char*, size t, const char*, va list ap); – similar to snprintf, but iscalled with a va list instead of a variable number of arguments.

• double rint(double x); – rounding the argument to an integer value in floating pointformat.

• int gettimeofday(struct timeval *, struct timezone *); – getting the time.

were implemented in missed in tm.c, a new file created in the FFmpeg root directory.Due to project time limitations, not all these functions were implemented completely.

Instead, simplified versions were implemented based on some assumptions. Tests haveshown that the assumptions are safe for FFmpeg. The functions snprintf and vsnprintfsimply call the standard functions sprintf and vsprintf respectively, which are availablein the TCS standard library. The parameter size t size that controls the maximumlength of the produced string is ignored in both cases. This assumption turned out to besafe for FFmpeg. The function gettimeofday returns only the number of seconds sincethe Epoch, because the number of milliseconds could not be determined from standardTriMedia functions.

4.4 Operating System Specific Features

There are some operating system (primarily Linux) specific features used in FFmpeg.The audio/video grabbing is an example. These features were disregarded, since theyare out of scope of the project.

4.5 Other Porting Issues

When porting FFmpeg to Philips TriMedia, the following aspects had to be taken intoaccount:

• The compiler had to be switched to Little Endian byte ordering since the defaultis Big Endian.

18 CHAPTER 4. PORTING FFMPEG TO THE TRIMEDIA

• The TriMedia hardware does not support double-precision floating-point (64-bit).As FFmpeg/libavcodec needs it, the compiler was switched to double-precisionfloating point simulation mode using the -fp64 compiler command line option.Unfortunately, it has a negative effect on performance since the simulation is donein software.

• The first versions of the TriMedia do not support unaligned load and store op-erations. As they are used in libavcodec, the environment had to be changed tothe TriMedia 2270. That is, TCS 4.5 was used instead of TCS 4.2 which didnot support the TriMedia 2270 architecture. The compiler command-line option-target tm2270Minimal is used to set the target architecture. The employed simu-lator is tm2270sim instead of tmsim.

After the porting is finished, FFmpeg runs on the tm2270sim simulator. The defaultmemory size assumed by the simulator is 8 MB. This is not sufficient for the experiments,so the simulator command line option -memorysize 128000000 was used asking it toassume 128000000 B of memory space.

4.6 Conclusions

In this chapter we have described the difficulties encountered when porting FFmpeg tothe TriMedia. First, since the C language used in FFmpeg utilises many of the latestfeatures of the GNU C compiler, some parts of the code were edited to conform to theC language accepted by the TriMedia compiler. Second, FFmpeg uses some data typesand functions from the C library that are not supported by the TriMedia CompilationSystem. To solve this, some header files were copied from the standard Linux “include”directories and several functions were implemented in the file missed in tm.c. Third,FFmpeg contained some operating system specific features that had to be safely elimi-nated. Finally we described several other compiler issues and the way to run FFmpegon TriMedia simulator.

Optimisation of Libavcodec forthe TriMedia 5In this chapter we describe what has been optimised in libavcodec for the TriMedia ar-chitecture, the optimisation methods, and the results of the optimisation.

5.1 Optimisation for the TriMedia Architecture

The measure used to determine a program’s performance used in the work is the numberof clock cycles required. The goal is to minimise the execution time of a program, tominimise the number of clock cycles as much as possible.

Taking into account the VLIW architecture of the target platform, the ideal case is thestage when the processor is saturated. That is, only the processor’s computing resources(the number and configuration of the available functional units) limit the performancean application. It means that the level of instruction level parallelism (ILP) is so highthat TriMedia executes 5 useful operations every clock cycle.

Unfortunately in an unoptimised application there are many factors that affect theILP negatively. The dependencies of the operations on each other force the compiler toschedule a lot of NOPs (No OPerations) instead of useful ones. A particular problem fora TriMedia program is branches. The compiled code is organised in so-called decisiontrees, and if there is not enough plain code between branches, a large part of the decisiontrees can consist of NOPs only, introducing many “wasted” clock cycles.

The TriMedia compiler can schedule plain code without branches and dependenciesmost efficiently. The goal of manual code optimisation is, therefore, to eliminate branchesand dependencies as much as possible.

5.2 Which Parts Optimise?

Since libavcodec is a very large project, it is impossible to optimise it completely in agiven time period. Some parts to optimise have to be chosen. The question is how tochoose these parts in the right way, so that the changes affect the speed of the overallprogram the most.

The TriMedia simulator (tm2270sim) gives information to the TriMedia profiler (tm-prof ) in a form of a statistics file. The latter can be processed by the profiler to giveinformation on decision trees or functions in the program: the number of executions,total cycles, instruction/data cache stall cycles, etc.

The profiler was used to obtain the most expensive functions of libavcodec in terms ofexecution time. These are the most computationally-intensive functions from the codecs,

19

20 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

for example the functions implementing the IDCT (Inverse Discrete Cosine Transform)algorithm.

The experiments were performed with input media files encoded by different codecs.The most computationally-intensive functions common to the largest number of the mostimportant (such as MPEG2 and MPEG4 [31, 14, 3]) codecs were chosen for optimisa-tion. For example, the function idctSparseColPut is one of the most time-consumingfunctions in the decoding process of MPEG2, MPEG4 and MJpeg codecs. In MPEG2decoding process this function took 9.14% of the total number of the application’s ex-ecution cycles. Several other functions belonging to the implementation of the IDCTalgorithm had results close to idctSparseColPut.

5.3 Optimisation Methods

The TriMedia Compilation System supports profile-based program optimisation meth-ods. Some techniques, such as grafting, are performed automatically by the compilertaking into account user directives in the form of grafting parameters. Newer compilerversions also support automatic if-conversion, function inlining, and loop unrolling. If-conversion is a technique used to reduce the number of branches in the code. To performthe if-conversion, the compiler converts the code to take advantage of the TriMediaguarded instructions, creates different evaluation paths with appropriate guards (this isreferred to as predicated execution). Other techniques, such as manual loop unrolling,use of restricted pointers, use of TriMedia custom operations etc. currently require userintervention. These techniques and grafting are described in detail below.

5.3.1 TriMedia Custom Operations

The TriMedia has a relatively low frequency due to its target market: low-cost appli-cations. Earlier versions of TriMedia ran at about 100 MHz, later versions at about500 MHz. The VLIW architecture aims to increase the performance by introducinginstruction level parallelism in the form of issuing 5 operations every clock cycle.

The TriMedia custom operations aim to increase the parallelism even further, intro-ducing parallelism inside one operation. The idea is that much more work can be done bya custom operation than by a general-purpose one. The custom operations are carefullychosen operations specialised for multimedia algorithms and implemented in hardware,providing a large performance gain.

For many multimedia algorithms, processing 32-bit data is wasteful since a significantamount of execution time is spent processing 8-bit or 16-bit data. Using 32-bit operationsfor them makes the use of 32-bit hardware inefficient. Among the TriMedia customoperations there are SIMD ones that allow using these 32-bit resources to operate on four8-bit or two 16-bit data items simultaneously, improving the performance by a significantamount with only a little increase in implementation cost. In addition to making the bestuse of standard resources, the TriMedia custom operations sometimes combine severalsimple operations into one. These combinations of operations are tailored specifically tothe needs of important multimedia applications. Some custom operations also eliminate

5.3. OPTIMISATION METHODS 21

conditional branches, enabling the scheduler to make effective use of all five operationslots in a TriMedia instruction.

The current TriMedia compiler is not able to use the custom operations when com-piling, so a programmer has to input them manually. There is an easy-to-use interfacefor utilising the custom operations in the C code. They are invoked just as regular Cfunctions/macros.

Below is an example of optimisations based on custom operations applied in libav-codec. It is taken from the function idctSparseColPut, one of the most time-consumingfunctions in DVD (MPEG2), DivX (MPEG4), and MJpeg:

a0 += W2 * col[8*2];if (col[8*4])

a0 += W4 * col[8*4];if (col[8*6])

a0 += W6 * col[8*6];

The code presented above has been changed to:

temp = IFIR16( PACK16LSB(W2,W4), PACK16LSB(col[8*2],col[8*4]) );a0 += IFIR16( PACK16LSB(temp,W6), PACK16LSB(1,col[8*6]) );

The ifir16 and pack16lsb TriMedia custom operations are used here. As illustrated inFigure 5.1, the pack16lsb operation packs the two least significant bytes of its inputoperands and places the result in its destination operand. The ifir16 operation is illus-trated in Figure 3.3 (Page 11), it produces the sum of products of the 16-bit halfwordsgiven in the input operands.

Figure 5.1: pack16lsb TriMedia custom operation

IFIR16( PACK16LSB(W2,W4), PACK16LSB(col[8*2],col[8*4]) )

is equivalent to

W2 * col[8*2] + W4 * col[8*4]

22 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

Thus, instead of 2 integer multiplications (which take 2 × 3 = 6 clock cycles) andone addition (1 clock cycle) everything is done by 2 packing operations (2× 1 = 2 clockcycles) and one ifir16 operation (3 clock cycles).

In the original code given above, the expensive multiplication operation is avoided ifan operand is zero. This technique assumes that the multiplication operation is muchmore expensive than a comparison and/or conditional branch. For TriMedia, this tech-nique is not applicable, because the branches can result in a very bad schedule causinga lot of wasted clock cycles. Instead, the branches were eliminated completely withoutaffecting the semantics.

This optimisation was applied to several functions in the IDCT algorithm, for severalvariables. As a result, the execution time of the function calling them decreased by 17%.

5.3.2 Algebraic simplifications

Some traditional, non-TriMedia-specific optimisation methods were applied in libav-codec. For example, in the same functions described above (for IDCT algorithm) asimple optimisation substituting a multiplication operation taking 3 clock cycles by shiftand addition operations taking 1 clock cycle each was applied:

#define W4 16383

a0 = W4 * temp;

was changed to

a0 = (temp << 14) - temp

This algebraic transformation is allowed because 16383 = 16384− 1 = 214 − 1.

5.3.3 Grafting and Loop Unrolling

Grafting increases parallelism within decision trees. This technique replaces a jump witha copy of the destination tree. Loop unrolling is a specialised version of grafting.

Obviously, grafting and loop unrolling increase the program size. This is a substantialdrawback because increasing the code size generally increases the number of cache missesand hence, the number of cache stall cycles. Grafting and loop unrolling are valuable toolsfor easy performance increase (since the compiler does some work on this automatically),but there is a trade-off between the amount of code size increase and the cache misses.

According to [18], grafting is not the solution to all performance problems. It helpson large parts of the code that are not very critical but still interesting, but the mostcritical parts can better be optimised by hand.

In this work, grafting itself (unlike loop unrolling) was never applied explicitly. Au-tomatic compiler optimisations that include grafting are considered to be sufficient.

5.3. OPTIMISATION METHODS 23

5.3.4 Restricted pointers

As most C programs, libavcodec makes heavy use of loads and stores through point-ers. According to the requirements of the C language, a compiler cannot make anyassumptions about pointers. Consider the following example:

int func( char *a, char *b, char *c ){

c[0] = a[0] * b[0];c[1] = a[0] * b[1];

}

In general case the pointers can refer to the same or overlapping memory locations.For example, c can be equal to a. If so, the second assignment statement can be datadependent on the first one, since c[0] equals a[0]. It means that the operations of the twostatements cannot be executed in parallel, scheduled to the same TriMedia instruction.

To make sure that the pointers are not aliased to each other, the compiler mustperform an inter-procedure analysis of the the whole application. However, if the pro-grammer is sure that the pointers refer to distinct arrays and do not alias, he can specifyit for the compiler declaring the pointers as restricted:

int func( char *restrict a, char *restrict b, char *restrict c ){

c[0] = a[0] * b[0];c[1] = a[0] * b[1];

}

Now the compiler knows that the data pointed to do not overlap in memory withany variable in the current context. The operations can be scheduled to the same VLIWinstruction, executing in parallel.

It is the responsibility of a programmer to verify that the pointers declared as re-stricted really do not overlap in memory; a wrong declaration of pointers to be restrictedleads to errors in programs.

5.3.5 Function Inlining

Function inlining gives considerable advantages: the call and return operations are elim-inated, the code becomes less fragmented (plain) giving more possibilities for the sched-uler to do a good job, and it reduces the number of decision trees since each functioninvocation adds a decision tree.

However function inlining should be used carefully when optimising for TriMedia forthe same reason as loop unrolling and grafting. It increases the number of instructions,affecting the instruction cache, increasing the number of stall cycles there. One shouldbe especially careful when applying both function inlining and grafting since it can wellaffect the performance negatively.

There is a difference between hand-inlining and automated one. The compiler per-forms automated loop unrolling before the function inlining. If there is a function call in

24 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

a loop, the automated loop unrolling (and grafting) is not done. Hence, a heavily usedfunction unaligned32 be was implemented as a #define instead of an inlined function.That is, in place of:

int unaligned32_be(int x){

return ( ((FUNSHIFT2(unaligned32(x),unaligned32(x)) & 0xFF00FF00) >> 8)| ((FUNSHIFT2(unaligned32(x),unaligned32(x)) & 0x00FF00FF) << 8) );

}

the following #define (hand-inlining) is used:

# define unaligned32_be(x) \( ((FUNSHIFT2(unaligned32(x),unaligned32(x)) & 0xFF00FF00) >> 8) |\((FUNSHIFT2(unaligned32(x),unaligned32(x)) & 0x00FF00FF) << 8) )

5.3.6 Branch Elimination

As indicated above, the TriMedia scheduler produces the best code if the program code is“plain”, i.e., it does not contain branches. Sometimes it is possible to eliminate branchesusing TriMedia custom operations and/or some tricks. For example ([18]), the statement

if ( p->data < v) cnt++;

can be replaced by

cnt += (p->data < v);

This works because in the TriMedia the boolean representation differs from that instandard C: the value TRUE is represented by 1 only, not by any number different from0. FALSE is 0.

The TriMedia custom operation iabs can be used in, for example, the following case:

name = (name < 0) ? -name : name;

This can be replaced by

name = IABS(name);

There is a number of custom operations performing clipping of the results, a verycommon case when multiple branches occur. For example, the custom operation iclipiperforms clipping of the argument into the range determined by the other argument.

5.3.7 Automatic Compiler Optimisation

The TriMedia compiler supports several levels of automatic optimisation. It is definedby the compiler command line option −O < number >. The levels are:

• −O0 - No optimisation.

5.4. AUTOMATIC TRANSFORMATION OF MMX TO TRIMEDIA CODE 25

• −O1 - Local optimisation.

• −O2 - Variables in registers.

• −O3 - Increased global optimisation.

• −O4 - Inter-procedural global optimisation, inlining.

• −O5 - Increased inter-procedural global optimisation, inlining.

The TriMedia Compilation and Simulation system features profile-driven compila-tion. It is a way of performance tuning using the compile-profile-recompile cycle. Duringthe first run, the compiler and simulator gather information about the code and writethe statistics to a file. At the next compilation this information is used to apply theright automatic optimisations.

5.4 Automatic Transformation of MMX to TriMedia Code

Parts of libavcodec have been implemented using MMX [16]. Philips is developing a toolto translate MMX instructions automatically to the code that can be compiled for theTriMedia, preferably using the TriMedia custom operations. In this section we describewhich conversion steps have been taken to employ this tool.

5.4.1 MMX to TriMedia Tool

Many multimedia processing software applications, including libavcodec, have specialoptimisations for multimedia extensions. One of the most commonly used extensions isMMX technology [16]. With the purpose of making porting of MMX-optimised softwareto the TriMedia fast and easy, Philips is developing a tool that automatically translatesMMX macros to TriMedia-compilable code.

The MMX to TriMedia tool is a set of C header files defining the MMX macros.A macro converts one MMX instruction to one or more TriMedia instructions and/orcustom operation(s) trying to cause the least possible overhead in the execution time.

Since in software the MMX code is often integrated into inline assembly code writtenfor the x86 architecture, Philips is developing another tool that converts inline assemblyof the Intel’s syntax to macros. This tool produces regular assembly macros as well asMMX macros, so the MMX to TriMedia tool also supports some basic x86 assemblyinstructions.

5.4.2 Using MMX to TriMedia Tool for Libavcodec

Libavcodec contains, among others, MMX optimisations. Most of the MMX code is in theform of GNU inline assembly. To make use of the MMX to TriMedia tool for libavcodec,the GNU inline assembly syntax first had to be converted to Intel inline assembly andthen to C code with MMX macros. First, the code is compiled on Windows using MSYS

26 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

and MinGW1. After that the Windows executable file is disassembled with the commandobjdump -d -mi386:intel which produces assembly code of the Intel’s syntax. Then thefunctions in libavcodec with GNU inline assembly are replaced with the correspondingdisassembled code. When changing the GNU inline assembly code with the disassembledIntel’s code, care must be taken of jumps, global and local variables. The absoluteaddresses should be changed to labels, constructs of the form [ebp-X] must be substitutedby the names of the function’s arguments, and constructs of the form [ebp+X] shouldbe changed to the names of local (automatic) variables or eliminated. Currently thesechanges can only be made manually. When the sources with the Intel inline assembly areready, they are processed with the tool which converts each instruction in the assemblycode into a macro. Finally the code with macros can be compiled using the MMX toTriMedia tool. The conversion steps and tools used are shown in Figure 5.2.

Figure 5.2: GNU inline assembly to macros conversion steps.

Figure 5.3 shows the code transformation steps taken to convert GNU inline assemblyto the macros form for the clear blocks mmx function in libavcodec. The left part presentsthe original GNU inline assembly. The middle part is the inline assembly of the Intel’ssyntax, which is derived from the GNU inline assembly by taking the first three stepsdescribed in Figure 5.2. The fourth step of the Figure 5.2 transforms the middle part ofthe Figure 5.3 into its right part.

5.4.3 Results and Conclusions

The experiment did not succeed. At first, only the functions of the IDCT algorithm werecompiled using the MMX to TriMedia tool to check the functionality and to comparethe performance gain obtained with that achieved by manual optimisation. The speedupgiven by the tool was comparable and sometimes even larger than that achieved bymanual optimisation, but the output video stream was damaged.

It is likely that the problem appeared while transforming the GNU inline assemblyto macros. But since the MMX to TriMedia tool is in the development stage, and has

1MSYS and MinGW [2] provide a Linux environment with GNU tool sets allowing compilation ofsome Linux software on Windows.

5.4. AUTOMATIC TRANSFORMATION OF MMX TO TRIMEDIA CODE 27

Figure 5.3: Code transformation steps (from left to right) taken to convert GNU inlineassembly to macros: the GNU inline assembly is transformed to Intel inline assembly,than the latter is converted to the macros form.

not been tested well yet, the problem can also be improper functionality of this tool.Discovering the reason(s) for the malfunction could take too much time, so further workwith the MMX to TriMedia tool was cancelled.

The MMX to TriMedia tool introduces a certain overhead because there is no one-to-one match of MMX instructions to those of the TriMedia. Some MMX instructions haveto be expanded into non-optimal sequences of TriMedia instructions. In other words, intheory a manually optimised code must perform better than MMX code compiled withthe tool. But in the case of libavcodec the IDCT algorithm implemented with C codediffers from that implemented with MMX code. Furthermore, the manual optimisationswere applied to the unoptimised C code. Hence, it is not correct to compare the per-formance gain achieved by the conversion tool to that given by the manual optimisationsince the original sources differ.

If the problem in the IDCT code can be fixed without affecting the execution time, itmeans that the MMX to TriMedia tool gives a good opportunity to improve the executiontime of applications that have already been optimised for MMX. In this case future workfor this project should include the integration of the rest of the MMX-optimised codeinto libavcodec (almost all of it is already transformed from GNU inline assembly tomacros).

Unfortunately, the process of transforming from GNU to Intel inline assembly tookmuch more time and effort than expected. Libavcodec contains very large parts ofGNU inline assembly code. For example, the largest file dsputil.c after preprocessingcontained approximately ten thousand lines of assembly code. The code transformationappeared to be very time-consuming and error-prone and we question if it is better thanto apply some manual optimisations spending the same amount of time.

In our opinion, the MMX to TriMedia tool is very useful for quick optimisation forthe TriMedia, if MMX-optimised code in the form of macros or Intel inline assemblyis available. If the conversion tool also supports GNU inline assembly, it will solve theproblem of GNU to Intel inline assembly conversion.

28 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

5.5 Experimental Results

5.5.1 Performance Measurement

The performance is measured in terms of the number of clock cycles needed on theTriMedia to perform a task. We focus on the task of decoding video and audio streamsin a file.

For measuring the gain by the optimisation of libavcodec for TriMedia several videofiles with different properties are used. First a video file is processed using the “original”libavcodec, that is libavcodec as it was right after porting was finished, before applyingmanual optimisations. Then the same video file is processed on libavcodec as it is afterthe optimisations. Only manual optimisations are considered. The same automatic com-piler optimisations (the -O command line option) are applied to both the unoptimisedversion of libavcodec and the optimised one.

The difference in speed is called the performance gain or execution time decreasehere. If a video file processing is finished in N clock cycles on unoptimised libavcodecand in M cycles on the optimised one, the performance gain (in %) is calculated asfollows:

performance gain(%) = (1−M/N)× 100.

The value defines how much faster the optimised libavcodec processes a file than theunoptimised library.

5.5.2 Simulation Tool and Benchmark

To obtain the number of clock cycles needed for performing a task, it is simulated withthe tm2270sim TriMedia simulator. The simulator is given a parameter defining thememory of 128 MByte. The simulator is also asked to produce a statistics file. Thecommand line issued is as follows:

tm2270sim -memorysize 128000000 -statfile name.stat ffmpeg g <FFmpeg arguments>

The output of the simulator and the statistics file can be used to obtain the numberof clock cycles used to perform.

For benchmarking FFmpeg performs decoding of video and audio streams in a file andstores them in raw video (yuv420p, resolution 160× 128, 25.00 fps) and audio (WAVE).The run time includes the overhead of reading/writing the files. The command given toFFmpeg is:

ffmpeg -i filename.avi videoout.yuv audioout.wav

5.5.3 Input Video Files

Ten video files of quite a good quality and high resolution are used for performancemeasurement. They are listed in Table 5.1. In the subsequent tables the File ID field isomitted to save space. The order of the files in those tables is the same as in Table 5.1.

5.5. EXPERIMENTAL RESULTS 29

File ID Video Audio ResolutionFrames

per secondTime (seconds)

01 mpeg2video mp2 720x480 30 1002 mpeg2video mp2 720x480 30 1003 mpeg2video mp2 720x480 30 1004 mpeg2video mp2 720x576 25 1005 mpeg4 mp3 640x480 25 2006 mpeg4 mp3 640x480 25 1007 mpeg4 mp3 640x480 25 1008 mpeg4 mp3 640x480 25 1009 mpeg4 mp3 640x480 25 1010 mpeg4 mp3 640x480 25 10

Table 5.1: List of video files used for performance analysis. All mp2 audios are 44100Hz, stereo, 64 kb/s; all mp3 audios are 48000 Hz, stereo, 128 kb/s. The MPEG4 fileswere encoded with DivX.

5.5.4 Results

5.5.4.1 Execution speed gain achieved

In Table 5.2 the execution time (in clock cycles) needed by libavcodec to decode the videofiles listed in Table 5.1 is given, together with the processor clock frequency required toperform the task in real time. The table shows the results before and after applying themanual optimisations to libavcodec. The performance gain is visualised in Figure 5.4.

Before optimisations After optimisations PerformanceCycles

(millions)Frequency

required (MHz)Cycles

(millions)Frequency

required (MHz)improvement

(%)

5046 504.6 3077 307.75 396419 641.92 3752 375.15 426347 634.73 3676 367.57 426784 678.45 4007 400.74 415587 279.37 3912 195.62 302992 299.16 2079 207.87 313192 319.25 2220 222.01 303498 349.79 2372 237.22 322347 234.67 1680 168.02 283518 351.77 2425 242.54 31

Average: 34.6

Table 5.2: Performance before and after optimisation

The MPEG2 decoder became approximately 41% faster and MPEG4 30% faster.One of the most effective optimisations applied was the function inlining. When the

inline level was set to default, the TriMedia compiler ignored the inline keyword in thedeclarations of many small functions. It was obvious from the profiling results, theyincluded the functions that were supposed to be inlined (these functions did not have toappear in the function list after the compilation). The inline level has been set to the

30 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

Figure 5.4: Performance before and after optimisation

maximum, and after that FFmpeg performed 17.2% faster in average.

Another considerable contribution to the performance improvement (7.5% forMPEG2 and 3% for MPEG4 decoding) was given by the set of optimisations applied tothe unoptimised C version of the IDCT algorithm. These optimisations mostly involvedthe use of SIMD TriMedia custom operations to increase the instruction level parallelism,also the use of restricted pointers and other techniques. An example of the optimisationsapplied to the functions implementing the IDCT algorithm is given in Section 5.3.1 (Page20).

5.5.4.2 Observations

A tendency was noticed that usually an optimisation gives more performance improve-ment if it is applied together with some other optimisations. In other words, if anoptimisation A gives a% and an optimisation B provides b% performance improvementof the unoptimised code, than applying both optimisations A and B gives a performanceimprovement which is more than (a + b)%. As an illustration Table 5.3 presents theresults of applying the set of optimisations on the IDCT algorithm alone and Table 5.4presents the results of applying them when the maximum level of function inlining isforced. This effect can be explained by a better compiler scheduling achieved when theoptimised functions are inlined. Based on this observation, the future optimisations canbe expected to increase even the effect of the currently applied optimisations.

MPEG4 decoder is less affected by the optimisations applied to the IDCT algorithmthan that of MPEG2. The average execution time decrease after applying the optimi-sations to the IDCT algorithm is 7.5% for MPEG2 and 3% for MPEG4 decoder. Thisaffects the results of all the applied optimisations: the MPEG2 decoder became 41% andMPEG4 30.33% faster.

5.5. EXPERIMENTAL RESULTS 31

Before optimisations After optimisations PerformanceCycles

(millions)Frequency

required (MHz)Cycles

(millions)Frequency

required (MHz)improvement

(%)

3874 387.36 3614 361.38 6.74670 467.04 4326 432.61 7.44582 458.17 4270 427.04 6.85143 514.29 4766 476.63 7.34965 496.54 4832 241.62 2.72651 265.06 2584 258.43 2.52847 284.73 2772 277.16 2.73133 313.32 3043 304.33 2.92071 207.13 2024 202.38 2.33094 309.39 3008 300.85 2.8

Average: 4.6

Table 5.3: Performance gain of IDCT algorithm’s optimisation without forcing inlining

Before optimisations After optimisations PerformanceCycles

(millions)Frequency

required (MHz)Cycles

(millions)Frequency

required (MHz)improvement

(%)

3316 331.64 3077 307.75 7.24067 406.70 3752 375.15 7.83961 396.14 3676 367.57 7.24352 435.17 4007 400.74 7.94033 201.66 3912 195.62 3.02139 213.87 2079 207.87 2.82289 228.88 2220 222.01 3.02454 245.39 2372 237.22 3.31723 172.34 1680 168.02 2.52503 250.31 2425 242.54 3.1

Average: 4.8

Table 5.4: Performance gain of IDCT algorithm’s optimisation applied together withforcing inlining

32 CHAPTER 5. OPTIMISATION OF LIBAVCODEC FOR THE TRIMEDIA

Parallelisation of Libavcodecfor Wasabi 6The Wasabi chip contains several CPUs. Although the final design has not been fixed,it is anticipated that most processors will be TriMedias. To exploit the computationalpower of multiple TriMedias, libavcodec must be parallelised to distribute the workloadamong several processors. In this chapter we describe how libavcodec optimised for theTriMedia was parallelised for Wasabi and present the performance improvement givenby the parallelisation, as well as some observations about the Wasabi architecture.

6.1 Parallelisation of the Original Libavcodec

In the original libavcodec the video encoding process is parallelised. The parallelisationis realised in the following way. If libavcodec is compiled with threads support, thefile pthread.c (with POSIX Pthreads interface [15]) or w32thread.c (with Windowsthreads [6] interface) is compiled and linked to libavcodec. The files contain functionsthat can be used to parallelise an application, such as function to crate and synchronisethreads. For each audio and video stream in every input file, and for both audio and videostreams of every output file, the main thread creates a number of threads (their numberis given as N in the -threads N command line argument to ffmpeg). The video encodingexploits data parallelism. In other words, each thread executes the same code but on adifferent piece of data. Before calling the most computationally intensive functions, themain thread splits the data among the given number of threads, prepares the contextfor each thread and invokes a function from the multithreading interface which activatesthe threads providing them with the function to run and the data to process.

Although the threads are created both for audio and video processing and both forencoding and decoding, only the video encoding algorithm is actually parallelised in theoriginal libavcodec. It looks like the video decoding and audio processing algorithms aresupposed to be parallelised in future versions of libavcodec.

Figure 6.1 illustrates how video encoding is parallelised in libavcodec.

6.2 Parallelisation of Libavcodec for Wasabi

As explained in the previous section, the original libavcodec contains:

• pthread.c: interface using POSIX Pthreads Application Programming Interface(API) for Linux systems.

• w32thread.c: interface using Win32 threads API for Windows systems.

33

34 CHAPTER 6. PARALLELISATION OF LIBAVCODEC FOR WASABI

Figure 6.1: Time-line illustrating how the libavcodec video encoder is multithreaded.

However, Wasabi supports the TriMedia Operating System Abstraction Layer (TMOSAL [30]) API. Therefore, a new file tmosal thread.c containing the TM OSAL par-allel interface had to be created. The Win32 threads interface was chosen as a base,because the POSIX Pthreads interface uses mutexes [9] and conditional variables [15]to synchronise. However, events [30], that can be used instead of conditional variables,are not fully supported on Wasabi yet. They could introduce some extra delay or evencause improper functionality. The Windows threads interface employs semaphores [9] tosynchronise and they are efficient and reliable on Wasabi.

The creation of TM OSAL interface involved a change of library function calls inthe Windows interface by equivalent calls for TM OSAL and some more changes whenan equivalent function could not be found. For example, the Windows function Be-ginThreadEx which creates a task has a good equivalent tmosalTaskCreate in TM OSAL.However, the function WaitForSingleObject, which is used to postpone the execution ofthe main thread until another task is finished, does not have an equivalent in TM OSAL.Another semaphore was used to signalise that a task has finished its work.

6.3 Parallelisation Results

The experiments below consisted of encoding video and audio from raw format. Videois encoded to MPEG4 and audio to MPEG2 format. The sample is 5 seconds long, theresolution is 720 × 480, and the frame rate is 29.97 frames per second. The bitrate is2000 kbit/s. The following command line performed the encoding operation:

ffmpeg -threads 8 -s 720x480 -r 29.97 -i in.yuv -i in.wav -b 2000 out.avi

The experiments show that the highest speedup is achieved when the number ofthreads created by FFmpeg is equal to the number of CPUs (TriMedias) in Wasabi.Hence below the number of CPUs determine the number of threads.

Table 6.1 shows the execution time for different numbers of processors. Figure 6.2depicts the speedup as a function of the number of CPUs. It can be seen that thespeedup is almost linear for up to 6 processors. After that it slightly levels off. But thescalability is not very impressive: on 2 CPUs the speedup is 1.8, on 4 CPUs it is 3.1,and finally on 8 CPUs it is only 5.1.

6.3. PARALLELISATION RESULTS 35

Number of CPUsNumber of cycles

(millions)Frequency needed forreal-time processing

1 21 626 4.3 GHz2 12 005 2.4 GHz4 6 906 1.4 GHz6 4 933 987 MHz8 4 203 841 MHz

Table 6.1: Number of cycles and frequency required to encode video and audio fordifferent numbers of processors.

Figure 6.2: Speedup as a function of the number of CPUs.

To explain this behaviour, Figure 6.3 depicts the thread activity during the encodingof one video frame. The rows represent the tasks (threads), the horizontal bars showwhen a processor was performing the task. A bar’s colour indicates which CPU wasexecuting the corresponding task. The top row represents the main thread ROOT, thenext seven rows named as IDLn (where n is a number) are “idle” threads meaningthat the processors were idle, and the following eight rows represent the tasks (threads)created by libavcodec. The figure shows thread activity while encoding a single videoframe only, but the behaviour of the entire application is similar, except for the startingpoint where the main thread performs some extra work (reading the input file etc.).

Figure 6.3 clearly shows that there are two steps in the encoding of a single frame thathave been parallelised, i.e., the encoding algorithm is divided into two computationallyintensive parts that have been parallelised. The first part is the motion estimationalgorithm for the P- and B-frames and the algorithm determining the spatial complexityfor the frame rate control for I-frames. The second part performs the actual encoding.Between these two parts the application (the main thread) merges the contexts afterthe first part and performs some other work, including preparing the MPEG4 pictureheader. This work is not parallelised and this is the reason why there is a gap between

36 CHAPTER 6. PARALLELISATION OF LIBAVCODEC FOR WASABI

Figure 6.3: Thread activity during the encoding of one video frame on 8 processors (theimage is provided by the Wasabi performance analysis tool TimeDoctor).

the two parallelised parts. In this gap only the main thread is performing useful workon a single CPU. The other processors are idle. This is also the main reason for the lackof scalability. In addition, the work is slightly unbalanced: some threads finish earlierthan others. The second parallelised part is performing quite well. Some CPUs finishtheir work earlier than others and stay idle, but not for very long.

Between the frames the main thread needs some time (very little compared to thatinside a frame) to process one or more audio frames and switch to the next video frame.

6.4 Conclusion

In this chapter we described how the video encoding was parallelised for Wasabi. Theparallelisation of libavcodec required only a little effort because the video encoding pro-cess has already been parallelised in the original library. The parallelisation involved anextension of libavcodec with a multithreading interface for TM OSAL which was createdbased on the interface for Windows threads, the latter existed in the original library.Unfortunately the video encoding is not parallelised as desirable, the speedup providedby increasing the number of CPUs is not scalable enough: for 8 CPUs it is only 5.1,corresponding to an efficiency of 64%. The reason for the lack of scalability is that themain thread is performing some non-parallelised work between the tasks that are paral-lelised in the process of video encoding. To increase the parallelisation scalability, thesenon-parallelised work should be either optimised so much that its effect can be neglected,or parallelised.

6.4. CONCLUSION 37

Some investigations can be made to determine if the parallelisation of the videodecoding and audio encoding and decoding make sense. If the communication and syn-chronisation overhead in the system introduced by the parallelisation does not compen-sate the advantages, the video decoding and audio encoding and decoding can also beparallelised in libavcodec using the same interface as the video encoding.

38 CHAPTER 6. PARALLELISATION OF LIBAVCODEC FOR WASABI

Interfacing LibavcodecRunning on TriMedia withApplications on a GeneralPurpose Processor 7In this chapter an interface between the TriMedia(s) executing libavcodec and host appli-cations that use libavcodec is proposed. This interface enables applications to efficientlyuse libavcodec running on the TriMedias, without having to port the applications them-selves to the TriMedia.

7.1 Introduction

Wasabi chip will contain several TriMedia (TM) chips and one or more general purposeprocessors (GPPs). If TriMedia chips run libavcodec and the GPP(s) with Linux – anapplication using it (as depicted in Figure 7.1), it gives a wide range of possibilities tospeed the system up.

Figure 7.1: Mapping of tasks on processors in Wasabi chip

The idea of this part of the project is to develop a mechanism to communicatebetween applications running on Linux (on GPPs) and libavcodec on TriMedia(s).

The main goal of the system is to reach a high performance. It must support mul-titasking, that is, each GPP should be able to run several applications using libavcodecsimultaneously. An important issue is to keep the user applications utilising libavcodecunchanged; it is certainly not desirable to require an application to deal with the inter-face issues. For an application, libavcodec is a library, and it should be used like thatdespite it actually runs on different processor(s). That is, an application should not be

39

40 CHAPTER 7. INTERFACING LIBAVCODEC RUNNING ON TRIMEDIA WITHAPPLICATIONS ON A GENERAL PURPOSE PROCESSOR

aware about the system architecture, it should use libavcodec as it would do that on asingle processor.

Before considering the interface itself, it makes sense to define what exactly shouldrun on a GPP and what – on TriMedia processor(s). There are two libraries underconsideration: libavcodec and libavformat. As described above, the former contains thecode processing media content, thus performing very computation intensive DSP tasks;the latter just reads/writes media files. The idea of using DSP processors, TriMedias, isto gain execution speed. The task performed by libavformat can be done by a GPP notworse (or better) than by TriMedia in terms of speed. Except that, any interface betweendifferent processors will introduce a certain overhead. Obviously, it makes sense to leavelibavformat on a GPP as it was and to run only libavcodec on TriMedia processor(s).

7.2 System Structure

Since the interface provided to user applications should not be changed, a library calledlibavcodec on GPP with Linux can be made that is in fact a “wrapper” providing interfaceto TriMedia processors as shown in figure 7.2.

Figure 7.2: Libraries - “wrappers”

This “wrapper” library must provide a set of libavcodec application programminginterface (API) functions. The set of proposed functions is given in Appendix A. Whena function is called by an application, the “wrapper” library on the GPP side must passthe function name and its parameters to the “wrapper” on the TriMedia side, whichcalls that function in the real libavcodec library. The result is passed by the wrapper onTriMedia side to that on GPP side, and the latter returns it to the application.

There is a number of questions that will be considered more in detail below:

• How to synchronise the GPP and TriMedia-side “wrappers”?

• How to pass the data from GPP to TriMedia-side “wrapper”?

• On a multitask system, how to distinguish which process sent a request?

7.3. “WRAPPER” SYNCHRONISATION 41

7.3 “Wrapper” Synchronisation

The Wasabi chip will support interrupts. Remote procedure calls can be implementedusing them. The GPP-side “wrapper” raises interrupt when a function call happens, theTriMedia-side “wrapper” does it when a function returns. Before raising an interrupt,the “wrapper” should check if the other its side is not currently busy with reading dataand is not starting its own data transfer; the state field of the function call structuredescribed in the section 7.4.1.1 keeps the current state, it must be 0 when a new eventhappens. The state field should be guarded by a semaphore to ensure that both sides ofthe “wrapper” cannot write it simultaneously.

When an interrupt is sent, a data structure is filled by the sender for the receiver.The data pointed by the function’s arguments follow in two (or more) buffers:

data buffer1 and data buffer2. The sender writes the data into data buffer1, then intodata buffer2 etc. When all the buffers are filled in, the sender rewrites the buffers in thesame order, first checking if the receiver has read the buffer it is going to write into.

When one side of the “wrapper” initialises a data transfer, it executes the followingalgorithm:

1. Acquire the semaphore for the state field of the structure function call ;

2. Set state to 1 (function call);

3. Fill in the function call structure’s function name and arguments’ fields;

4. Raise an interrupt (inform the receiver about the data transfer);

5. Fill in data buffer1, data in buffer1 size and set buffer1 state to 0 (has been writtenby the sender);

6. While there is more data to transfer, repeat the previous step for everydata bufferN ;

7. If there is more data to transfer, go back to 5. But now, before writing todata bufferN, check if bufferN state is 1 (has been read by the receiver);

8. When the last chunk of data has been written, wait while the receiver sets the flagbufferN state of that chunk to 1 (has been read by the receiver). Set state to 0(idle) and release the semaphore for it.

On the receiver side, when an interrupt informing that the data are being transferredis received, the following algorithm is executed:

1. Read the function name and arguments’ information;

2. When buffer1 state is 0 (has been written by the sender), read data in buffer1 sizebytes from data buffer1 and set buffer1 state to 1 (has been read by the receiver);

3. Repeat the previous step for every data bufferN until data in bufferN size is lessthe the data bufferN ’s size. When all the buffers have been read and there is stilldata to read, switch back to the first one.

42 CHAPTER 7. INTERFACING LIBAVCODEC RUNNING ON TRIMEDIA WITHAPPLICATIONS ON A GENERAL PURPOSE PROCESSOR

7.4 Data Transfer Between GPP and TriMedia

The GPP(s) and TriMedia(s) have a shared memory. The data will be transferredthrough it. The question is how exactly to realise it.

7.4.1 Data to Transfer

The idea is to call libavcodec API functions, so the function itself and its argumentsmust be transferred. It can be done by creating a structure specifying the functionname (or identifier – just an integer number known to both sides of the “wrapper” –to economise memory) and the parameters. If multitasking is involved, the structureshould also contain information about the process that called the function so that theresult can be given back to it.

7.4.1.1 Transferring data including those pointed by pointers in structures

If there is a small communication buffer which cannot hold all the data pointed by thearguments, all the data must be transfered through a FIFO.

The function call structure may look like the following:

struct function_call{

/* ** The following field specifies what is happening at the moment:* 0 - idle, available for a new action;* 1 - a function call is happening;* 2 - a function returns.* It should be guarded by a semaphore to ensure that the two* wrapper sides cannot write simultaneously.* */unsigned short state;

/* ** The maximum length of a function name depends on the list of* implemented API functions* */char function_name[MAX_NAME];

/* ** An alternative to the above field is:* unsigned long function_id;* */

/* The identifier of the processor that called the function: */unsigned long process_id;

7.4. DATA TRANSFER BETWEEN GPP AND TRIMEDIA 43

unsigned long argument1;unsigned long argument1_original_pointer;unsigned long argument2;unsigned long argument2_original_pointer;/* ** All the arguments follow here.* The number of arguments depends on the list of implemented* API functions* */

/* ** The data pointed by the pointers above follow in chunks* in data_buffer1 and data_buffer2. The fields bufferN_state* control the process of reading/writing these data, they can* have the following values:* 0 - data_bufferN has been written by the sender (can be* read by the receiver);* 1 - data_bufferN has been read by the receiver (the next* chunk can be written by the sender);* */int buffer1_state;unsigned long data_in_buffer1_size;char data_buffer1[BUFFER_SIZE];

int buffer2_state;unsigned long data_in_buffer2_size;char data_buffer2[BUFFER_SIZE];

/* May be there are more buffers here */}

Splitting the data into several buffers is used to speed up the data transfer: while thesender is writing the second buffer, the receiver can read the first one etc. If there wereonly one big buffer, the receiver would have to wait while the sender writes it completelybefore reading it.

The choice of the type unsigned long for the arguments is based on the observationthat the parameters of all the functions presented in Appendix A are pointers and otherinteger numbers; that is, they can all fit into an unsigned integer number. As the receiverhas the functions’ prototypes, it must be able to treat the types correctly. So the GPPside of the “wrapper” type casts all the parameters to unsigned long, that of TriMediaside – back.

If some functions with floating point arguments have to be implemented, this shouldbe changed. There are two ways to work around this: or to introduce in the structurea set of separate fields for the floating point parameters, or to work with them in lowlevel, writing floating point numbers into unsigned long-s and treating correctly. The

44 CHAPTER 7. INTERFACING LIBAVCODEC RUNNING ON TRIMEDIA WITHAPPLICATIONS ON A GENERAL PURPOSE PROCESSOR

first solution is better in terms of development speed and computational effort of the“wrapper”, the second – in terms of optimal memory usage.

Among the functions’ arguments, there are pointers to large structures containingmultimedia data. Ideally, they should remain unchanged: the “wrapper” just passesthese pointers to libavcodec on TriMedia. Unfortunately it is impossible because thememory areas allocated by an application on Linux are most probably not contiguous.The GPP-side “wrapper” must copy the data to the contiguous buffers data bufferN ofthe structure function call and pass the physical address of the structure to TriMedia.

The pointers in the argumentN fields of the structure should be updated – set tothe addresses inside data bufferN fields. To avoid complications with defining in whichbuffer an address is, all the buffers may be considered as one infinite buffer. The receivercan easily detect in which buffer an address is. For example, if data bufferN ’s size isBUFFER SIZE and a pointer’s value is 3 × BUFFER SIZE + N when there are2 buffers (data buffer1 and data buffer2 ), the receiver knows that it should read bothdata buffer1 and data buffer2, than read them again to get the following two chunks ofdata, and the pointer points to the offset N of the data buffer2 (the 4th chunk).

When a function returns, the “wrapper” must know where the output data should becopied to; that is what there are fields argumentN original pointer for. If argumentN isa pointer to the memory where output data must be written by libavcodec on TriMedia,argumentN original pointer holds the Linux virtual address of the output buffer wherethe GPP-side “wrapper” must copy the data from argumentN after the function returned.If argumentN is not a pointer, argumentN original pointer is 0.

7.4.1.2 Transferring Only the Pointers

The main drawback of the system proposed above is that the wrappers must be awareof the internal structure of the data passed. It becomes very complex if the structurespointed by the function’s arguments contain other pointers to other structures (and it isusual for libavcodec). All the pointed structures must be transfered recursively.

In some functions libavcodec can change (assign) some of the pointers inside datastructures. In this case the wrapper must be able to detect the following:

• that a function assigned a pointer inside a structure

• where the assigned data are on GPP (their virtual address). For that the wrappermust keep a table of addresses of TriMedia-side structures and the correspondingstructures on GPP.

Some structures contain function pointers that can be invoked from libavcodec, andthan the wrappers must be able to organise it. Fortunately, all the function pointersneeded seen so far have default implementations inside libavcodec and thus it seems thatthis problem can be avoided by passing NULL instead of them (but than the applicationmust be changed if it uses these function pointers).

The wrappers become very complex...A solution to the problem arising when libavcodec changes data pointed by pointers

is that both the “wrapper” sides work in the same address space. That is, only pointers

7.4. DATA TRANSFER BETWEEN GPP AND TRIMEDIA 45

are passed from GPP to TriMedia. It also considerably simplifies the “wrappers”. Tomake it possible, GPP must be able to allocate enough buffer to place all the data in it.

Unfortunately even in the case of transferring only pointers between the “wrapper”sides, without changing the client application, the GPP-side “wrapper” must be familiarwith the internal structures in libavcodec. That is because the application runs in Linuxaddress space, and while transferring the data to the shared address space and back, theGPP-side “wrapper” must take care about all the nested pointers in structures.

7.4.2 Memory Allocation

There is one major problem with memory allocation for data transfer: on GPP-side,the “wrapper” operates on a Linux system. Usually applications running on Linux donot have to bother with physical memory since Linux provides a sophisticated memorymanagement mechanism which allows any application to have its own large linear virtualmemory space. However, the “wrapper” on GPP cannot pass to the one on TriMediaa virtual address since it is meaningless for the latter: to resolve a virtual address, ithas to access the Linux page tables. The “wrapper” on GPP-side must provide physicaladdresses, and even more – if a buffer it allocates is bigger than one page, it must occupycontiguous pages in physical memory, which is usually not the case when the space isallocated in Linux with malloc since the memory is fragmented. Besides that, Linuxshould not swap the allocated page(s).

Linux kernel does provide a programmer with the necessary functions to reach thegiven goal. Further discussion of memory issues is based on the information from [22], [24]and [23]; unfortunately, not all the sources found in Internet agreed on several questions,so these books were chosen as the most reliable ones. The information on writing Linuxkernel modules comes from [25] (for the kernel 2.6.*), [26] (for the kernel 2.4.*), [21] and[20]. All the mentioned sources are available on-line.

7.4.2.1 “Wrapper” Structure on GPP Side

The GPP-side “wrapper” needs to work with kernel functions that are available only forkernel modules. The proposed structure is shown in figure 7.3:

Figure 7.3: GPP-side “wrapper” structure

The Linux kernel module is loaded with insmod command (see [25] and [26]). Itallocates the buffer of the required size with kmalloc (can allocate up to 128 KByte) orget free pages (can allocate maximum around 2 MB on most systems with Linux 2.0 andlater) function (see [22]). The GPP-side “wrapper” libavcodec library calls the interfacefunction get wrapper buffer addr (it must be checked that the name does not contradict

46 CHAPTER 7. INTERFACING LIBAVCODEC RUNNING ON TRIMEDIA WITHAPPLICATIONS ON A GENERAL PURPOSE PROCESSOR

with other global Linux kernel symbols). The module returns the virtual address of thebuffer that can be used by the library to transfer the data.

As the system works, the memory becomes more and more fragmented; it is possiblethat there is no contiguous space in memory of the size the module requests and Linuxcannot solve this problem by swapping/rearranging the allocated pages. In this casethe module will fail to allocate memory. To avoid this situation, the memory must beallocated as early as possible after the system boot.

If media processing with libavcodec is one of the major tasks of the system, thedifferent technique for memory allocation is advised. To get a big buffer of physicallycontiguous memory, we can allocate it by requesting memory at boot time. Allocationat boot time is the only way to retrieve consecutive memory pages without the limits onthe size imposed by get free pages, both in terms of maximum allowed size and limitedchoice of sizes. To use this technique, the module must be linked directly in the kernelimage; that is, the kernel must be rebuilt with it. With the kernel 2.3.23 and later, thefunctions alloc bootmem pages and others can be used. The way appropriate also forolder kernels is reserving the top of RAM by passing the mem= argument to the kernelat boot time, reserving a part of memory from kernel’s usage. The memory allocatedin this way can never be freed until the system reboot (see [22], chapter 7 and 13 fordetails).

7.4.2.2 Buffer Size

The buffer size allocated for GPP – TriMedia data transfer depends on the purposeWasabi chip is used for. If media processing with libavcodec plays a major role in thesystem, a big part of memory (could be 50%) should be dedicated for that and it shouldbe allocated at boot time to make sure that allocation succeeds. Otherwise, if libavcodecis used on demand and it should not consume much resources, a smaller buffer (allocatedat run time) has to be used.

The bigger the buffer is, the quicker the system works. If the buffer is big enough forall the data a function needs, the TriMedia-side “wrapper” can pass pointers to thosedata directly to libavcodec. Otherwise, the data are passed in several steps, introducingan additional delay.

The maximum amount of memory that can be allocated with kmalloc is only 128KByte; get free pages on Linux kernel version 2.0 and before can allocate up to 25 pagesof memory, the later versions – up to 29 pages (on most systems corresponds to about 2MByte).

Some experiments were made on a system with about 470 MByte of memory. Atest module invoked the get free pages(9) function (allocating about 2 MByte) in a cycleuntil it failed 5 times. Right after booting the system get free pages(9) succeeded around170-180 times, that is allocating about 180 × 2 = 360 MByte. After several days ofrunning the system (without heavy extra load) get free pages(9) succeeded about 120times, allocating approximately 120× 2 = 240 MByte.

Hence, if the communication buffer is allocated at run time, it looks quite save toassume that the allocation of 2 MByte will succeed. It is also possible to allocate severalbuffers of 2 MByte to speed up the data transfer on the expense of complicating the

7.5. CONCLUSION 47

“wrappers”, adding multi-buffer support.

7.4.2.3 Transferring the Communication Buffer Address

Before using the scheme described above to transfer data between GPP and TriMedia,GPP-side “wrapper” must give to that of TriMedia side the address of the communicationbuffer.

There are different ways to do that, depending on the final Wasabi architecture. IfWasabi has some kind of common registers for all the processing units, the address canbe transferred to TriMedia through them. If not, there must be a memory word withconstant address reserved for this purpose. Another way is, if the GPP-side “wrapper”works in boot time, reserving a memory area from kernel as described above.

7.5 Conclusion

The interface proposed in this chapter is considerably complex, requires a significanteffort to be implemented. The main reason for the complexity is that the applicationand libavcodec work in different domains, in Linux and another operating system. Linuxutilises the Memory Management Unit (MMU) which creates the virtual address spaceand the fragmentation of the memory. An operating system running on the TriMediacannot work with the virtual memory of Linux, cannot handle the virtual addressesand the fragmented memory. Because of this, to transfer data from the general-purposeprocessor (GPP) to the TriMedia, the GPP-side “wrapper” must copy the data passed tothe called API function to a contiguous memory buffer. If the communication buffer (thecontiguous memory buffer used to transfer data from the GPP to the TriMedia) is notlarge enough to keep all the data, the data must be transfered by parts through a smallerbuffer (or several buffers) in the “First In First Out” (FIFO) manner. This introduceseven more data transfer overhead. The overhead of the data transfer can minimise theadvantage of running libavcodec on the TriMedia instead of a GPP. Currently there is nosolution to this problem except for running Linux on TriMedia so that the applicationson a GPP and TriMedia use the same address space, but this is not possible yet.

48 CHAPTER 7. INTERFACING LIBAVCODEC RUNNING ON TRIMEDIA WITHAPPLICATIONS ON A GENERAL PURPOSE PROCESSOR

Conclusions and Future Work 8In this chapter the results of this work are summarised, some conclusions are drawn, andsome recommendations for future work are given.

8.1 Summary

In this work an investigation has been performed of optimisation of a large software codeclibrary (libavcodec) for the Philips Wasabi multiprocessor. First the library was portedto the TriMedia. Then some architecture-specific optimisations were applied to it. Afterthat the video encoding process was parallelised for the Wasabi chip multiprocessorwhich is currently being developed at Philips Research. Finally an interface betweenan application running on a general-purpose processor and libavcodec running on theTriMedia DSPs was discussed.

The optimisation for the TriMedia involved some conventional optimisation tech-niques such as algebraic simplifications and function inlining, and architecture-specificoptimisations including techniques to increase the instruction-level parallelism such asbranch elimination and usage of the custom operations. To measure the performancea cycle-accurate simulator of the TriMedia has been employed. As benchmarks severalhigh-resolution MPEG2 and MPEG4 videos have been decoded. The optimisations im-prove the performance of the MPEG2 and MPEG4 decoders by approximately 41% and30%, respectively. The most effective optimisations were the manual optimisation of theIDCT algorithm and forcing the compiler to perform function inlining.

The parallelisation for Wasabi involved changing the multithreading API. The videoencoding algorithm in the original libavcodec has already been parallelised using themultithreading API provided by Windows and POSIX Pthreads. The Windows inter-face was used as a base for the new interface for the multithreading API TM OSAL usedin Wasabi. The experiments consisted of encoding video from raw format to the MPEG4format. The speedup obtained is almost linear for up to 6 processors, then it slightly lev-els off. Unfortunately the encoding algorithm was not completely parallelised, only twothe most expensive parts of it were processed by multiple processors. That is the reasonwhy the speedup does not scale very well: with 8 CPUs it is only 5.1, corresponding toan efficiency of 64%.

Finally an interface between an application which runs on a general-purpose processorand libavcodec running on TriMedia processor(s) was proposed. It allows an applica-tion to call libavcodec’s library functions without being aware of the system structure,making the interface transparent for the application. The idea behind it is to make itpossible to use any applications written for Linux on the Wasabi system with (paral-

49

50 CHAPTER 8. CONCLUSIONS AND FUTURE WORK

lelised) libavcodec running on TriMedia CPUs without having to port the applicationitself to the TriMedia.

8.2 Conclusions and Recommendations

In this section we evaluate the project and give some recommendations.The process of porting libavcodec to the TriMedia took (too) much time. Many other

open-source applications, which may also be ported to the TriMedia, are also written forthe latest GNU compiler and would have the same porting problems. These problemscan be solved by extending the GNU compiler with a back-end for the TriMedia. Thissolution is better than extending the front-end of the TriMedia compiler to support thenew features of the GNU compiler because it requires less effort. The back-end of theGNU compiler has to be extended only once. After that, new extensions of the front-endof the GNU compiler will not require any more work on the back-end for the TriMedia.On the contrary, if the approach to extend the front-end of the TriMedia compiler tosupport the syntax of the GNU compiler is taken, it should be updated every time whenthe GNU compiler’s front-end gets new features, this becomes more and more expensivewith time. Another approach is to create a “C-to-C” compiler which will produce codesupported by the TriMedia compiler from the code for the GNU compiler. This solutionhas the same drawback as extending the front-end of the TriMedia compiler: the “C-to-C” compiler must be updated every time the GNU compiler is extended. Besides that,a “C-to-C” compiler will produce a more or less unreadable C code for the TriMediacompiler. If Philips wants to incorporate the TriMedia extension into an open-sourceproject, i.e. submit the TriMedia extension to the project maintainers and propose toinclude them in the project, an unreadable code can prevent the project maintainers fromaccepting it. This conclusion is based on the requirements for the code style that areposed by the maintainers of libavcodec for those who want to submit some extensions.

Libavcodec supports many optimisations. As many other multimedia applications(for example the XviD codec), libavcodec supports MMX optimisations. The automaticMMX-to-TriMedia conversion tool described in Section 5.4 tries to exploit the existingMMX optimisations for a fast and easy optimisation for the TriMedia. The results of thiswork show that this approach is not fast and easy with the current MMX-to-TriMediatool in case the MMX optimisations are in the form of GNU inline assembly. The reasonis that the MMX-to-TriMedia tool expects the instructions in the form of macros whichcan be obtained with another tool, but the latter accepts only the Intel inline assemblysyntax. Manual code transformation from GNU to Intel inline assembly syntax is verytime-consuming and error-prone. This problem can be solved by extending the inlineassembly to macros conversion tool with GNU inline assembly support. Furthermore,these tools can be incorporated into the TriMedia compiler to make the conversionprocess transparent to the user.

The parallelisation process performed in this work proved that Wasabi is well pre-pared for a quick parallelisation of applications that already have been parallelised. Withthe advantages of hyperthreaded processors and chip multiprocessors, it is expected thatmore and more open source applications such as libavcodec will support multithreading.The only way to improve this feature of Wasabi even further is to extend its libraries with

8.3. FUTURE WORK 51

support for POSIX Pthreads, because parallelised open-source applications for Linux willmost probably utilise POSIX Pthreads.

The interface suggested in this work is considerably complex. The primary reason ofthe complexity is that the application runs in the Linux domain and libavcodec works inthe TriMedia domain which uses another, less sophisticated operating system. Becauseof the Memory Management Unit (MMU) which creates the virtual memory addressspace and the memory segmentation on Linux, the memory issues appear: the interfacemust copy all data from the Linux virtual address space to contiguous buffers in thephysical memory. First, this copying has a large negative effect on the performance.Second, it complicates the interface significantly because of the nested pointers in thedata structures that should be copied. The only solution to these problems is to runLinux on the TriMedias.

8.3 Future Work

As future work on optimisation of libavcodec for Wasabi, further TriMedia-style opti-misations are highly recommended. The optimisations targeted at the increase of theinstruction level parallelism are expected to significantly improve the performance of thetasks running on the TriMedia processors, because currently most parts of libavcodecdo not employ the features offered by the VLIW architecture and a large amount of auseless code (NOPs) is executed.

The scalability of libavcodec’s video encoder execution time when the number ofCPUs is increased can be improved by parallelising the work that is performed by themain thread while the other tasks are idle. Another approach is to improve this workusing the TriMedia-style optimisations so that its effect is less important. Furthermore,the video and audio decoders and audio encoder can be parallelised. The potential of thisparallelisation should first be investigated, if it makes sense in terms of the introducedsynchronisation and communication overhead in the system.

Finally, the interface between the general-purpose processor and the TriMedia canbe designed in a greater detail and implemented.

52 CHAPTER 8. CONCLUSIONS AND FUTURE WORK

Bibliography

[1] FFmpeg-Based Projects, http://ffmpeg.sourceforge.net/projects.php.

[2] MinGW - Home, http://www.mingw.org/.

[3] MPEG Pointers and Resources, http://mpeg.org.

[4] MPlayer homepage, http://www.mplayerhq.hu/.

[5] The FFmpeg project, http://ffmpeg.sourceforge.net/.

[6] Windows threads, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/multiple_threads.asp.

[7] Codec shoot-out 2003 - 2nd installment, (2004), Available at http://www.doom9.org/index.html?/codecs-203-1.htm.

[8] Inc. Advanced Micro Devices, 3DNow! Technology Manual, (2000), Avail-able at http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf.

[9] David Culler, J.P. Singh, and Anoop Gupta, Parallel Computer Architecture. AHardware/Software Approach, Sept 1998.

[10] John L. Hennessy and David A. Patterson, Computer Architecture, a QuantitativeApproach, third ed., 2003.

[11] Bill Homer, Restricted Pointers, (1995), WG14/N448, X3J11/95-049. Available atftp://ftp.dmk.com/DMK/sc22wg14/c9x/aliasing/.

[12] R. Jochemsen, H.264 Decoder Optimization for Wasabi, Philips Research TechnicalNote 2003/00491 (2003).

[13] X. Li, J.T.J. Eijndhoven, and B.H.H. Juurlink, 3D-TV Rendering on a Multipro-cessor System on a Chip, Proc. ProRISC, Nov. 2004.

[14] Ze-Nian Li and Mark S. Drew, Fundamentals of Multimedia, Oct 2003.

[15] Andrae Muys, POSIX Pthreads Tutorial, http://www.cs.nmsu.edu/~jcook/Tools/pthreads/pthreads.html.

[16] Alex Peleg, Sam Wilkie, and Uri Weiser, Intel MMX for Multimedia PCs, Commu-nications of the ACM 40 (1997), no. 1, 24–38.

[17] Philips, TM1100 Preliminary Data Book, March 1999.

[18] , TriMedia Compilation System 4.5 - User Manuals, vol. 4 - CookBook, Sept2004.

53

54 BIBLIOGRAPHY

[19] , TriMedia Compilation System 4.5 - User Manuals, vol. 3 - CompilationTools, Sept 2004.

[20] Alessandro Rubini, Dynamic Kernels: Discovery, (1996), Available at http://www.linuxjournal.com/article/1220.

[21] , Dynamic Kernels: Modularized Device Drivers, (1996), Available at http://www.linuxjournal.com/article/1219.

[22] Alessandro Rubini and Jonathan Corbet, Linux Device Drivers, second ed., June2001, Available at http://www.xml.com/ldd/chapter/book/.

[23] Alessandro Rubini and Georg Zezschwitz, Dissecting Interrupts and Browsing DMA,(1996), Available at http://www.linuxjournal.com/article/1222.

[24] David A. Rusling, The Linux Kernel, 0.8-3 ed., Available at http://en.tldp.org/LDP/tlk/tlk.html.

[25] Peter Jay Salzman, Michael Burian, and Ori Pomerantz, The Linux Kernel Mod-ule Programming Guide, 2.6.1 ed., Sept 2005, Available at http://tldp.org/LDP/lkmpg/2.6/html/.

[26] , The Linux Kernel Module Programming Guide, 2.4.0 ed., Sept 2005, Avail-able at http://tldp.org/LDP/lkmpg/2.4/html/.

[27] Paul Stravers and Jan Hoogerbrugge, Homogeneous Multiprocessing and the Futureof Silicon Design Paradigms, Proc. of International Symposium on VLSI Technology,Systems, and Applications (VLSI-TSA), Apr 2001.

[28] Spacecake team, A quick experiment with a multi-threaded MPEG2 decoder for HDvideo.

[29] Jos van Eijndhoven, Jan Hoogerbrugge, M.N. Jayram, Paul Stravers, and AndreiTerechko, Cache-coherent heterogeneous multiprocessing as basis for streaming ap-plications, vol. 3, pp. 61–80, 2005.

[30] Arjan van Lankveld, User Documentation – OSAL, Philips SemiconductorsRTG/PID/2003/0086 (2003).

[31] John Watkinson, The MPEG Handbook. MPEG-1, MPEG-2, MPEG-4, second ed.,Nov 2004.

libavcodec and libavformatAPI AThis libavcodec and libavformat application programming interface (API) functions’ listis not claimed to be complete. It contains only the main (“high level”) functions, mostlikely to be used in an application utilising the libraries.

The list is based on ffmpeg.c, avcodec sample.cpp example and on libav-codec/apiexample.c.

A.1 libavcodec API

A.1.1 void av register all(void)

Initialises libavcodec and registers all the codecs and formats. It calls the followingfunctions:

• void avcodec init(void) that must be called before any other function(s) of libav-codec. It initialises the static tables of DSP utilities;

• void avcodec register all(void) which actually registers codecs. By “registration” ofa codec pushing a structure with its properties to a linked list of activated codecsis meant.

Under strict constraints for the execution time and code size a codec can be registeredmanually. See the source code of the void avcodec register all(void) function for detailson how to do it.

A.1.2 AVCodec *avcodec find decoder(enum CodecID id)

Among the registered codecs, finds decoder with the ID id.

A.1.3 AVCodec *avcodec find decoder by name(const char *name)

Among the registered codecs, finds decoder with the name name.

A.1.4 AVCodec *avcodec find encoder(enum CodecID id)

Among the registered codecs, finds encoder with the ID id.

A.1.5 AVCodec *avcodec find encoder by name(const char *name)

Among the registered codecs, finds encoder with the name name.

55

56 APPENDIX A. LIBAVCODEC AND LIBAVFORMAT API

A.1.6 int avcodec open(AVCodecContext *avctx, AVCodec *codec)

Opens codec. Assigns values to avctx based on codec. Allocates extra memory for codecparameters if necessary. Initialises the codec.

A.1.7 AVFrame *avcodec alloc frame(void)

Allocates a AVPFrame and set it to defaults. This can be deallocated by simply callingfree().

A.1.8 AVCodecContext *avcodec alloc context(void)

Allocates a AVCodecContext and set it to defaults. This can be deallocated by simplycalling free().

A.1.9 int avpicture get size(int pix fmt, int width, int height)

Get the size in bytes needed for a picture of the type pix fmt with the dimensionswidth× height.

A.1.10 int avpicture fill(AVPicture *picture, uint8 t *ptr, int pix fmt,int width, int height)

Assign the appropriate fields of the picture struct to ptr address. Returns the size (usedin avpicture get size function for this purpose). The other parameters are the same asin avpicture get size: picture type and dimensions.

A.1.11 int img convert( AVPicture *dst, int dst pix fmt, const AVPic-ture *src, int src pix fmt, int src width, int src height )

Convert the image of the dimensions width× height given in src of the type src pix fmtinto dst (allocated in caller) of the type dst pix fmt.

A.1.12 void av free(void *ptr)

Does the following:

if(ptr) free(ptr);

A.1.13 int avcodec close(AVCodecContext *avctx)

Closes a codec given in avctx. That is, frees the memory etc.Always returns 0.

A.2. LIBAVFORMAT API 57

A.1.14 int avcodec decode video( AVCodecContext *avctx, AVFrame*picture, int *got picture ptr, uint8 t *buf, int buf size )

Decode a video frame.Parameters:buf – bit stream buffer, must be FF INPUT BUFFER PADDING SIZE

larger than the actual read bytes because some optimisedbit stream readers read 32 or 64 bit at once and couldread over the end.

buf size – the size of the buffer in bytes.got picture ptr – zero if no frame could be decompressed. Non zero otherwise.

Returns −1 if error, otherwise returns the number of bytes used.

A.1.15 int avcodec encode video( AVCodecContext *avctx, uint8 t*buf, int buf size, const AVFrame *pict )

Encodes pict with the codec specified in avctx to the buffer buf of the size buf size.

A.1.16 int avcodec decode audio(AVCodecContext *avctx, int16 t*samples, int *frame size ptr, uint8 t *buf, int buf size)

Decode an audio frame.Returns −1 if error, otherwise – the number of bytes used. If no frame could be

decompressed, frame size ptr is zero. Otherwise, it is the decompressed frame size inbytes.

A.1.17 int avcodec encode audio( AVCodecContext *avctx, uint8 t*buf, int buf size, const short *samples)

Encodes audio samples with the codec given in avctx. Writes the result to buf of buf sizesize.

Returns the size.

A.2 libavformat API

A.2.1 int av open input file(AVFormatContext **ic ptr, const char*filename, AVInputFormat *fmt, int buf size, AVFormatParam-eters *ap)

Open a media file as input. Read the file header and fill in the structure AVFormatCon-text passed as the first parameter (allocated in the caller).

Parameters:ic ptr – the opened media file handle is put herefilename – filename to openfmt – if non NULL, force the file format to usebuf size – optional buffer size (zero if default is OK)ap – additional parameters needed when opening the file (NULL if default)

58 APPENDIX A. LIBAVCODEC AND LIBAVFORMAT API

Returns 0 if OK. AVERROR xxx otherwise.

A.2.2 void av close input file(AVFormatContext *s)

Close a media file (but not its codecs).Parameters:s – media file handle

A.2.3 int av find stream info(AVFormatContext *ic)

Read the beginning of a media file to get stream information. This is useful for fileformats with no headers such as MPEG. This function also computes the real frame ratein case of MPEG2 repeat frame mode.

Parameters: ic – media file handleReturns >= 0 if OK. AVERROR xxx otherwise.

A.2.4 void dump format(AVFormatContext *ic, int index, const char*url, int is output)

User interface function printing the information about the input or output format.Parameters:ic – a pointer to the structure of the type AVFormatContext.index – index of the input/output stream.url – the source (filename).is output – defines if is an input or output stream.

A.2.5 int av read frame( AVFormatContext *s, AVPacket *pkt )

Read the next frame of a stream. The returned packet is valid until the next call toav read frame() or av close input file() and must be freed with av free packet.

For video, the packet contains exactly one frame. For audio, it contains an integernumber of frames if each frame has a known fixed size (e.g. PCM or ADPCM data). Ifthe audio frames have a variable size (e.g. MPEG audio), then it contains one frame.

pkt->pts, pkt->dts and pkt->duration are always set to correct values inAV TIME BASE unit (and guessed if the format cannot provided them). pkt->ptscan be AV NOPTS VALUE if the video format has B frames, so it is better to rely onpkt->dts if you do not decompress the payload.

Returns 0 if OK, < 0 if error or end of file.

A.2.6 int av write header(AVFormatContext *s)

Allocates the stream private data and writes the stream header to an output media file.Parameters:s – media file handle

Returns 0 if OK, < 0 if error or end of file.

A.3. SEVERAL TECHNIQUES 59

A.3 Several techniques

A.3.1 Find the first video stream

int videoStreamNumber = -1;for( i=0; i < pFormatCtx->nb_streams; i++ )

if( pFormatCtx->streams[i]->codec.codec_type == CODEC_TYPE_VIDEO ){

videoStreamber = i;break;

}

A.3.2 Get a pointer to the codec context for the video stream

pCodecCtx = &pFormatCtx->streams[videoStreamNumber]->codec;

A.3.3 Find the decoder for the video stream

AVCodec *pCodec = avcodec_find_decoder(pCodecCtx->codec_id);

A.3.4 Find the MP2 encoder

AVCodec *codec = avcodec_find_encoder(CODEC_ID_MP2);

A.4 Get more information

For more information on the functions provided the source code should be investigated.A tip: see libavformat/utils.c and libavcodec/utils.c for many useful functions, some

of them even commented.

60 APPENDIX A. LIBAVCODEC AND LIBAVFORMAT API

Curriculum Vitae

Demid Borodin was born in Samarkand, USSRon the 1st of September 1982.

In 1989-1999 he was educating at school.In 1999 he started a 4-year bachelor program

in Tashkent State Institute of Aviation, Uzbek-istan, at the faculty of Civil Aviation. Thespecialisation was Exploitation of Aviation andSpace Engineering, the division Electrical Equip-ment of Aviation and Space Technique.

In 2003 he graduated with a bachelor’s degreeand started Master of Science program in Com-puter Engineering in Delft University of Technol-ogy (TU Delft), the Netherlands. In October 2004he started his master project at Philips Research,Eindhoven. Paul Stravers (IT, Philips Research)and Ben Juurlink (CE, TU Delft) supervised thiswork.