A Safari Through the MPSoC Run-Time Management Jungle

J Sign Process Syst (2010) 60:251–268DOI 10.1007/s11265-008-0305-4

A Safari Through the MPSoC Run-TimeManagement Jungle

Vincent Nollet · Diederik Verkest · Henk Corporaal

Received: 28 May 2008 / Revised: 21 August 2008 / Accepted: 17 October 2008 / Published online: 13 November 2008© 2008 Springer Science + Business Media, LLC. Manufactured in The United States

Abstract The multiprocessor SoC (MPSoC) revolutionis fueled by the need to execute multiple advancedmultimedia applications on a single embedded comput-ing platform. At design-time, the applications that willrun in parallel and their respective user requirementsare unknown. Hence, a run-time manager (RTM) isneeded to match all application needs with the availableplatform resources and services. Creating such a run-time manager requires two decisions. First, one needsto decide what functionality to implement. Second, onehas to decide how to implement this functionality inorder to meet boundary conditions like e.g. real-timeperformance. This paper is the first to detail a genericview on MPSoC run-time management functionalityand its design space trade-offs. We substantiate therun-time components and the implementation trade-offs with academic state-of-the-art solutions and a briefoverview of some industrial multiprocessor run-timemanagement examples. We show a clear trend towardsmore hardware acceleration, a limited distribution ofmanagement functionality over the platform and in-creasing support for adaptive multimedia applications.

V. Nollet (B) · D. VerkestIMEC, Kapeldreef 75, 3001 Leuven, Belgiume-mail: [email protected]

D. Verkeste-mail: [email protected]

H. CorporaalTechnical University Eindhoven,Eindhoven, The Netherlandse-mail: [email protected]

In addition, we briefly detail upcoming run-time man-agement research issues.

Keywords Run-time management ·Multiprocessor SoC · Taxonomy

1 Introduction

The need for handling more complex applications whilealso reducing the overall cost of embedded devicesfuels the Multi-Processor System-on-Chip (MPSoC)revolution.

These MPSoC platforms integrate computer archi-tectural properties of various computing domains. Justlike large-scale parallel and distributed systems, theycontain multiple heterogeneous processors intercon-nected by a scalable, network-like structure [1] toachieve high performance. As in most mobile embed-ded systems, there is a need for low-power operationand predictable behavior. Depending on the numberof applications and their respective needs, platformresources need to be used in a flexible (programmable)way. Finally, to decrease time-to-market and to supportsoftware reuse, these MPSoC platforms should be easyprogrammable and provide scalable hardware compo-nents that enable scalable software solutions. Hence,from a manufacturers viewpoint, the key requirementswith respect to both the MPSoC hardware and soft-ware are: high performance, low power consumption,easy programmability, predictability, flexibility andscalability.

From an end-user viewpoint, the actual value offuture MPSoC devices does not lie within the integratedcircuit itself, but in its capability to provide an amount

252 J Sign Process Syst (2010) 60:251–268

of multimedia services or experiences. Video decod-ing and encoding, to support e.g. video players, videoconferencing and mobile TV, is undoubtedly a majordriver application for mobile devices. Settop boxes withembedded MPSoC platforms will provide services likedigital television, Internet access and in-home enter-tainment. In a more distant future, handheld deviceswill be providing augmented reality services. Thuspaving the way for media-rich, location-aware andcontext-aware applications like e.g. attaching audio andvideo objects to physical objects [2].

Executing these multimedia applications on suchMPSoC platforms requires matching the applicationneeds with the offered MPSoC platform services. Dueto changing run-time conditions with respect to e.g. userrequirements or having multiple simultaneously exe-cuting applications competing for platform resources,there is a need for a run-time decision-making entity:the run-time manager. From an abstraction viewpoint,the run-time manager is located between the MP-SoC platform hardware layer and the application layer(Fig. 1).

The contribution of this paper is twofold. (1) We arethe first to provide a generic description of the typicalrun-time manager components with their respectiveroles and responsibilities. These components can beidentified in existing industrial and academic run-timemanagers. (2) To the best of our knowledge, we arethe first to describe the MPSoC run-time managementimplementation space and to motivate the potential de-sign trade-offs. The implementation space and its trade-offs are substantiated by classifying existing industrialand academic run-time management solutions.

Application(s)

MPSoC Platform Services

Quality Manager

Resource Manager

Run-TimeLibrary

Sys

tem

Man

ager

Run

-Tim

e M

anag

er

Policies

Mechanisms

Figure 1 On a high level, the run-time manager contains systemmanagement components and a run-time library. In turn, thesystem management component contains a quality manager anda resource manager.

The rest of this paper is organized as follows.Section 2 details the different components present ina generic run-time manager. The different componentsare briefly motivated with state-of-the-art examples.Section 3 defines the run-time management designspace. This design space is also substantiated with state-of-the-art run-time management practices. Section 4details existing multiprocessor run-time managers anddefines their location in the design space. Section 4also describes and classifies some academic and in-dustrial run-time manager examples. Section 5 detailsa few upcoming research challenges for the run-timemanagement domain. Finally, Section 6 presents theconclusions.

2 Run-Time Management Components

This section details the responsibilities of the run-timemanager, i.e. what services should the run-time man-ager provide with respect to the application and whattasks should it take care of with respect to the platform.We first detail the system management components.Consequently, we describe the role of the run-timelibrary.

2.1 System Management

As Fig. 1 illustrates, three basic functions are containedwithin system management: the quality managementblock, a block containing the resource managementpolicies, and a block with the resource manage-ment mechanisms.

2.1.1 Quality Management

Dealing with the MPSoC run-time management re-quirements and constraints in an efficient way can beachieved (1) by enabling the run-time manager to ex-ploit design-time and run-time application knowledge,and (2) by tuning the run-time manager in order to takeinto account the platform properties and the offeredhardware services.

The quality manager, sometimes referred to asQuality of Service (QoS) manager, is a platform inde-pendent component that interacts with the application,the user and the platform-specific resource manager.The goal of the quality manager is to find the sweet-spot between the capabilities of the application, i.e.what quality levels does the application support, the re-quirements of the user, i.e. what quality level providesthe most value at a certain moment, and the availableplatform resources. In order to be generic, the quality

J Sign Process Syst (2010) 60:251–268 253

manager should contain two specific subfunctions: aQuality of Experience (QoE) manager and an operatingpoint selection manager (Fig. 2).

The QoE manager deals with quality profile man-agement, i.e. what are, for every application i, thedifferent supported quality levels qij and how are theyranked. For example, a video application can makedifferent trade-offs when it comes to framerate, res-olution and image quality. Every application qualitylevel qij is associated with a platform resource vec-tor �rij = {r p

ij, rmij , rc

ij} that describes how much platform

processing (r pij), memory (rm

ij ) and communication (rcij)

resources are needed in order to support this quality.This resource vector can be determined either by theapplication mapping tools or by profiling an alreadymapped application. Every quality level qij also comeswith its own set of implementation details dij. Theseimplementation details are forwarded to the resourcemanager.

The QoE manager also has to determine the valueevery quality provides with respect to the user of theapplication. One could attribute a fixed value to everyapplication quality level. However, this assumes thata certain quality level always provides the same valueto the user. In reality, a certain quality level mightprovide a different user value or experience at differentmoments in time [3] depending on a range of parame-ters. Consider, for example, a mobile TV application.Intuitively it is easy to understand that the user qualityneeds (i.e. framerate, resolution, image size and imagequality) will be different for viewing a commercial than

Appli

{(qi1,ri1,di1),(qi2,ri2,di2), ..., (qin,rin,din)}

Operating PointSelection Manager

{(vi1,ri1,di1),(vi2,ri2,di2), ..., (vin,rin,din)}

(vik,rik,dik)

User application preferenceEnvironment informationPlatform information

QoE Manager

v = fu(...)

Platform resourceinformation

Qua

lity

Man

ager

Application quality levels

Application operating points

Selected operating point

Figure 2 Run-time quality management. (1) The QoE managerreceives a set of application quality levels and their associatedresource needs from the design-time phase and attributes avalue to each quality level according to a session utility functionfu(· · · ). The result is a set of application operating points. (2)The operating point selection manager selects a good operatingpoint according to a system utility function and with respect tothe available platform resources.

for watching a sports event. In addition, the desireduser video quality will also be different in case highquality implies that the battery will not last until theend of the movie or that other applications will have tobe terminated. Hence, a session utility function1 fu(· · · )needs to be defined [4–6] in order to determine arelative value vij for every quality level qij at a certainmoment. The user input and other parameters suchas e.g. expected input data, platform information (likebattery status) and time/location can be reflected infu(· · · ). The end result is an ordered set of applicationquality levels qij, each with an associated value vij,resource vector �rij and implementation details dij. Thetuple (vij, �rij, dij) is further denoted as an applicationoperating point.

The operating point selection manager deals withselecting the best quality level or operating point forall active user applications given the current platformresource status and non-functional constraints like e.g.available energy. The goal of the operating point se-lection manager is to maximize the system utility func-tion [4–6]. This means selecting an operating point(vij, �rij, dij) for every active application i such that thetotal system utility is maximized, while the total amountof required resources does not exceed the availableplatform resources. In its simplest form, the systemutility function maximizes the total application value.

A considerable number of authors use a qualitymanager on top of a resource manager to manage ap-plication and global system quality. Wust et al. [7], Brilet al. [4] and Van Raemdonck et al. [8] all provide suchrun-time management setups. Pastrnak et al. [9, 10]distinguish a design-time phase to estimate applicationresource needs and quality settings and, consequently, arun-time quality selection phase. In addition to a globalquality manager, some authors also introduce a local,application-specific quality manager that manages qual-ity within the boundaries negotiated with the globalquality manager.

Separating quality management from resource man-agement and creating a set of well-defined (in thefuture, maybe even standardized) interfaces betweenthem, enables reuse and interchangeability of theserun-time management components. In addition, theseparation supports the platform based design para-digm [11] as it allows a global quality manager towork with a resource manager instantiated inside anapplication-service providing third party component.

1Also denoted as job utility function [4] or application utilityfunction.


2.1.2 Resource Management: Policies

Once an application operating point (vij, �rij, dij) hasbeen selected (Fig. 2), the resource requirements �rij

and the implementation details dij containing the appli-cation task graph (TG) are known. Some authors [5]also include a fixed task graph to platform resourcesmapping in the implementation details dij. Such a fixedresource allocation greatly reduces the decision free-dom of the resource manager. So, in order to remainflexible, one should rely on a run-time resource man-ager to map the application task graph (TG) onto theplatform architecture graph (AG). In other words, todecide on the allocation of platform resources and toensure the execution of this decision. This means thatapplication tasks need to be assigned to processingelements, data to memory, and that communicationbandwidth needs to be allocated. As Fig. 3 illustrates,the resource management policy can be divided into aplatform independent resource management algorithmand platform specific cost functions.

The different ways the task graph can be mappedonto the architecture graph define the mapping solutionspace. The resource management algorithm is responsi-

T1

T2a

T3

Mem12

Mem23

T2b

DSP MEMGPP

T1

T2a T2b

T3 Mem23

Mem12

DSP MEMGPP

Task Graph (TG)

Architecture Graph (AG)

Resource ManagementAlgorithm

PE Allocation

Comm. Allocation

Memory Allocation

Cost functions

Resource Manager

Figure 3 The resource manager maps a task graph onto an archi-tecture graph. The policy for doing so contains a platform inde-pendent resource management algorithm and platform specificcost functions. The resource management algorithm determineshow the mapping solution space is searched, while the cost func-tions allow the algorithm to assess the quality of a certain (maybepartial) mapping solution.

ble for the way this solution space is searched. Differenttypes of algorithms can be used in this context depend-ing on e.g. the trade-off of mapping speed versus map-ping quality, or depending on the distribution of themapping algorithm over the platform (see Section 3.1).Casavant et al. [12] provide a comprehensive taxonomyof those (platform independent) resource managementalgorithms in the domain of general purpose paralleland distributed systems. The decision making processof this algorithm relies on platform specific cost func-tions to determine the quality of a certain (maybepartial) mapping solution. This taxonomy is illustratedby Fig. 4.

The first distinction of this taxonomy, static versusdynamic, is made based on when the task assignment ismade. In both the static and dynamic domain, schemescan be divided into optimal and sub-optimal mappingsolutions. In case all information regarding the prob-lem is known and it is computationally feasible onecan use an algorithm that derives the optimal solu-tion (e.g. searching the full solution space). Within thealgorithms that provide sub-optimal solutions, we canidentify two categories: approximate and heuristic. Theapproximate solution uses the same algorithm as theoptimal solution. However, instead of searching for anoptimal solution, the algorithm is stopped as soon asa satisfactory solution is found. Heuristic solutions usesome simple rules of thumb to determine a good (butoften non-optimal) solution. All the optimal and sub-optimal/approximate techniques can be classified asbased on searching the solution space by enumeration,on graph theory, on mathematical programming, or onqueuing theory.

Besides the hierarchical scheme of algorithms, de-tailed in Fig. 4, there is a flat classification that con-tains properties that can appear in any node of thehierarchy. The flat portion of the algorithm taxonomyhandles properties like adaptivity, i.e. if the mappingpolicy takes the current and/or the previous behav-ior of the system into account, load distribution orone-time assignment versus dynamic reassignment, i.e.whether a resource assignment remains or can be revis-ited during the application execution. Tanenbaum [13]denotes these policies as non-migratory and migratoryrespectively.

So the platform independent part of the resourcemanager could be e.g. a heuristic, a branch & boundalgorithm, a genetic algorithm, a full solution spacesearch algorithm, etc. that uses the platform specificresource cost functions to determine the overall cost orquality of a certain TG to AG mapping. Intuitively, it iseasy to understand that the cost of the communication


Figure 4 Taxonomy ofresource assignmentalgorithms (policies) forparallel and distributedsystems [12].

Global

Static

Optimal

Sub -optimal

Dynamic

Distributed Non-distributed

EnumerativeQueuing Theory

Cooperative Non cooperative

Optimal Sub -optimal Competitive Non-competitive

Approximate Heuristic

Approximate Heuristic

Graph TheoryMath.

Programming

resources, allocated between two processors, will bedifferent in case these processors are interconnectedby a simple bus or by a complex NoC. In a NoC, thecost of communication depends on the number of hopsbetween source and destination tile, while this is notapplicable in case of a bus. A similar reasoning holdsfor determining the cost of allocating tasks to differenttypes of processing elements and for different types ofmemories.

In recent years, a number of authors have consideredthe task graph to architecture graph mapping and re-source allocation problem. Hu et al. [14] use a branch &bound algorithm to walk through the task graph TG toarchitecture graph AG mapping search space. Hanssonet al. [15] use a greedy design-time heuristic to mapan application graph of IP blocks onto an empty inter-connection network graph. This design-time heuristiccombines placement of IP blocks with NoC path selec-tion and timeslot allocation. Recently, Stuijk et al. [16]employs a design-time heuristic to assign and schedulea task graph onto an architecture graph. Both the orderat which tasks are selected for assignment as well asthe tile cost for a certain task is determined by a costfunction. Several authors perform the mapping in twophases: a design-time phase and a run-time phase. Yanget al. [17, 18] use a genetic algorithm for the design-time detection of the most promising task graph toarchitecture graph assignment and scheduling options.At run-time, they use a greedy heuristic [18] to selectright options for all active task graphs. Ma et al. [19]extend this approach by introducing a run-time localsearch heuristic to further optimize the scheduling (notthe assignment) produced by the greedy heuristic.

2.1.3 Resource Management: Mechanisms

The resource manager makes the resource allocationdecisions. However, for executing these decisions, theresource manager has to rely on the underlying mech-anisms. A mechanism describes a set of actions, theorder in which they need to be performed and theirrespective preconditions or trigger events. In order todetect trigger events, a mechanism relies on one ormore monitors. The action is performed by one ormore actuators. The mechanisms are typically associ-ated with the platform dependent parts of the resourcemanagement policies depicted in Fig. 3. The resourcemanagement mechanisms closely collaborate with therun-time library in order to perform both basic alloca-tion functions, like instantiating tasks on the process-ing elements, allocating memory blocks and setting upinter-task communication structures, and more com-plex functions like e.g. run-time task migration.

2.2 Associated Platform Run-Time Library

The Run-Time Library (RTLib), shown in Fig. 1, essen-tially has two functions. First, it provides primitives toabstract the services provided by the hardware ((1) inFig. 5a). At design time, these RTLib primitives areused by the designer or the design tools. At run-time,the primitives are called by the application. This meansthat an RTLib implementation should be available onevery processing element. Secondly, it acts as the run-time interface to the system manager ((2) in Fig. 5a).In that sense, it plays an important role in enforcingthe decisions of the quality manager and the resource


Figure 5 Run-Time Libraryroles. a Primitives to abstractthe hardware service(1) andprimitives that provide aninterface to the systemmanager(2). b Both the send()and transfer() RTLib datamanagement primitives relyon the DMA hardwareservice.

Applicationi

MPSoC Platform Hardware Properties and Services

System QualityManager

Resource Manager(Policy)

Resource Manager(Mechanism)

Run-TimeLibrary

Ext

erna

l Met

adat

aIn

terf

ace

Sys

tem

Man

ager

{(qi1,ri1,di1), ..., (qin,rin,din)}

1

2 API

(a)

Applicationi

primitive API

Platform Services

transfer(...)

DMAplatform service

send(...)

RTLib

(b)

manager. Section 4 details some well-known industrialand academic RTLib examples.

With respect to the provided primitives, one candistinguish three types:

– Quality management primitives. These primitiveslink the application or the application-specific qual-ity manager to the system-wide quality managerembedded in the run-time manager. This allows theapplication to e.g. renegotiate the selected qualitylevel [7, 9, 10].

– Data and communication management primitives.These primitives are closely linked to the program-ming model. For example in a programming modelwhere tasks communicate by passing messages,the typical primitives require send() and receive()primitives. To allocate memory blocks, either forprocessor-local or shared memory, one requiresmalloc() and free() alike primitives. Finally, theRTLib can also provide primitives to manage thememory hierarchy. In order to manage scratch-pad memory a designer would need a transfer()function to move arrays or parts of arrays throughthe memory hierarchy [20]. As Fig. 5b illustrates,both the transfer() and the send() RTLib primitivecan rely on the platform hardware DMA serviceto transfer data to move data to another memoryhierarchy layer or the destination task respectively.The RTLib is responsible for configuring the DMAservice. In case of software controlled caches, theRTLib could provide invalidate() and prefetch()primitives [21].

– Task management primitives. These primitivesallow a designer to create and destroy tasks (e.g.spawn() and join()) and manage their interaction(e.g. with semaphores, mutexes and barriers). The

RTLib can also provide primitives to select a spe-cific PE-local scheduling policy or to explicitly in-voke the local scheduler (e.g. yield()). In addition,there are primitives used for task migration oroperating point switching (e.g. migrationpoint() orswitchpoint()). These primitives actually interfaceto the resource manager and the quality managerrespectively ((2) in Fig. 5a).

The RTLib run-time cost will depend on its imple-mentation: ranging from a software implementationexecuting on the processor that also executes the appli-cation tasks to a separate hardware engine next to theapplication processor. For example, Paulin et al. [22, 23]use a hardware message passing engine to reduce theinter-task communication cost. In the Eclipse hardwaretemplate, Rutten et al. [24] use a hardware interfaceblock, denoted hardware shell between the computeengines and the communication hardware. This shellis instantiated at every compute engine and acts as arun-time library. The shell provides its computationengines with five generic interface primitives for taskmanagement and data management. Section 3.2 dis-cusses the trade-offs between hardware and software inmore detail.

3 Run-Time Management Implementation Space

Section 2 only details what functionality a run-timemanager should contain. One still has to decide howthis functionality should be implemented. This sectiondetails the run-time management implementation de-sign space and its trade-offs. These trade-offs have afundamental impact on how to satisfy the key MPSoCrequirements: flexibility, scalability, predictability, highperformance and low power operation.


In general, one can identify three run-time man-agement implementation space axis (Fig. 6). The firstaxis deals with the amount of distribution of the run-time manager. The second axis details the spectrumbetween a hardware and a software implementation.The third axis makes a trade-off between a genericor a flexible or domain specific implementation. Bychoosing the right implementation point, it should bepossible to design an efficient run-time manager thatmeets the application needs and that satisfies the keyMPSoC requirements. The rest of this section describesthe run-time management implementation space andillustrates the trade-offs that can be made with researchand industrial examples.

3.1 Centralized Versus Distributed

As we are dealing with multiprocessor systems, one candistinguish a design space that deals with the amountand type of distribution of run-time management func-tionality. Figure 7 details the classic multiprocessor run-time management categories [25, 26] based on howthe processing elements share run-time managementresponsibilities.

– Master–Slave configuration. In a master–slave con-figuration, there is a single master processor thatexecutes the run-time manager, i.e. that monitorsand assigns work to the slave processors. In addi-tion, the master is responsible both for part of thecomputation and the I/O jobs. The slave processorsonly execute user applications code. This meansthat a run-time library is instantiated on the masterand on every slave processor. The slave processorshave to wait while the master is handling theircalls to the system manager. The benefits of thistype of run-time management implementation areits simplicity and efficiency when the slaves are

Adaptive

DistributedCentralized

SW

NonAdaptive

HW

Figure 6 Run-time manager design space: the run-time managercan be implemented in hardware or software, distributed orcentralized, non-adaptive or adaptive (i.e. tuned towards theapplication).

Symmetric

Master-Slave SeparateSupervisor PE PEPE(M) PE(S)

PE PE

System Management

Run-Time Library

Figure 7 Design space for multiprocessor systems with respectto distribution of run-time management responsibilities.

mainly used for compute intensive jobs. As there isonly one processor executing the system manager,synchronization with respect to shared resourcescan be implemented in a straightforward way. Thesingle master is also the main disadvantage: it risksbecoming a single point of failure or a bottleneckthat fails to serve the slaves with enough work.

– Separate Supervisor configuration. In this case,every processor executes its own run-time manage-ment functionality and has its associated run-timelibrary and data structures. Hence, each processoracts as an independent system. Special structuresand mechanisms exist to achieve global systemmanagement, i.e. collaboration across the processorboundary. This type of system is scalable, grace-fully degrades in case of processor failure, and asingle processor cannot become a management bot-tleneck. Unfortunately, due to duplication of datastructures there is a memory penalty. In addition,there is a management overhead in optimally con-trolling and using the system resources with respectto the user application.

– Symmetric configuration. There is a single run-timemanager executed by all processors concurrently.Access to shared data structures needs to be han-dled by critical sections. Although the symmetricconfiguration is the most flexible of all configura-tions, it also has some downsides. First, in contrastto the other configurations, this configuration re-quires a homogeneous set of processing elementswith a shared memory. Second, it is the most diffi-cult one to implement in an efficient way. Similarto the Master–Slave configuration, the executionof the run-time manager can become a bottleneck.This can be solved by allowing multiple processorsto have concurrent access to disjoint run-time man-agement components. Hence, the scalability of thissystem lies between a Master–Slave configurationand the Separate Supervisor configuration.


Furthermore, it is easy to envisage an MPSoC sys-tem that uses a mix of configurations. As Section 4details, the next generation ST Nomadik platforms, forexample, will combine a Symmetric configuration witha Master–Slave configuration [23].

The taxonomy presented by Casavant et al. [12](Fig. 4) also handles the distribution of resource man-agement policies. In the realm of dynamic algorithmsone can distinguish distributed and non-distributedalgorithms, based on whether the resource assignmentis done by a single processor (i.e. the Master–Slave con-figuration) or is distributed among multiple processors(i.e. Symmetric or Separate Supervisor configuration).Depending on the existence of interaction between thedifferent processors, distributed resource managementcan be split into cooperative and non-cooperative. Non-cooperative resource management means individualprocessors make assignment decisions independent ofthe actions of the other processors. Cooperative re-source management still means that processors maketheir own decisions, but they collaborate to reach acommon system-wide goal.

3.2 Hardware Versus Software

System management functionality or a run-time li-brary is typically implemented in software (e.g. soft-ware scheduler) by building on the low-level hardwareservices provided by the platform (e.g. timer interruptservice). In recent years, also fueled by the MPSoCrevolution, quite some run-time management func-tions have been implemented in additional hardware.2

Hence, one can identify a hardware versus softwaredesign axis that is applicable to both the system man-agement and the run-time library. The main motivationfor implementing part of the run-time manager in hard-ware or a separate accelerator is to avoid the overheadcaused by executing the run-time management func-tionality on the application processor [22, 23, 27, 28].

Furthermore, a (more) dedicated implementation ofrun-time management functionality should satisfy theMPSoC run-time manager boundary constraints moreeasily. First, besides being significantly faster than itssoftware counterpart [22, 23, 28–30], a hardware imple-mentation holds the promise of being more energyefficient [31]. Secondly, both the maximum responsetime and its variance decrease, which improves the real-time behavior of the system [27]. This is partly caused

2This often means an additional processor or programmable IPblock that provides run-time management services.

by the fact that there is significantly less cache spaceneeded for the run-time management functionality. Inaddition, the memory footprint of the run-time man-ager decreases [32]. Finally, by implementing this func-tionality in a separate block, the run-time managementcomplexity caused by the heterogeneity of multipleplatform processing resources is mitigated [33]. Indeed,one can combine heterogeneous application processingwith homogeneous run-time management. As Ruttenet al. [24] explain, their hardware shell takes careof all system-level run-time management issues, whilethe compute engine designers focus on applicationfunctionality.

Consider two tasks executing on a single applicationprocessor. In case of software scheduling, the scheduleris invoked at regular intervals by a platform timer in-terrupt (action (1) and (2)) or it is called when an eventoccurs (e.g. (3) when semaphore changes or a messagearrives). This causes overhead because the schedulerneeds to be executed on the application processor evenwhen the newly selected task ends up being the sametask (action (2)). When moving this functionality toa platform hardware service (Fig. 8b), the run-timeoverhead is kept to a minimum. Instead of interruptingthe executing task to attend to the management func-tionality, the decision making is done in parallel. Thismeans that valuable application processor cycles arenot wasted while taking management decisions. When adecision is made by the platform hardware service, theprocessor specific actions, like the task context switch,still need to be performed by the processor itself. Inaddition, decreasing the management time granularityin case of a software scheduler (i.e. time between clockinterrupts) creates a proportional increasing overhead.In contrast, a platform hardware service can work at afine granularity without causing additional applicationprocessor overhead.

In general, most of the state-of-the art [22–24, 27, 28,34] focuses on implementing run-time library function-ality in hardware. This mainly includes making schedul-ing decisions for the (local) application processor orhardware accelerator, providing support for memorymanagement and handling inter-task communicationand synchronization. However, one can also rely on ahardware block to perform task to processor assign-ment in a multiprocessor environment [32, 35]. Finally,hardware support is also used for collecting run-timeinformation. This involves non-intrusive monitoring ofwhat is happening on the platform. In real-time sys-tems, it is important to minimize the intrusiveness ofthe monitoring, i.e. it should not alter the system be-havior. This can be achieved by adding the monitoringfunctionality in hardware [36].


Figure 8 Hardware orSoftware? a A hardwaretimer service periodicallyinvokes a software scheduler,which causes fewer processorcycles to be available for theactual user application. b Thescheduler is implemented inadditional hardware and onlyinterrupts the applicationprocessor when a contextswitch is needed.

Interrupt Service Routine

SchedulerContext switch

Semaphore call

Task 2

Task 1

RTM

Time

Task 2

Task 1

RTM

(1) (2)

(3)

PlatformhardwareService

(2)

Time

PlatformhardwareService

Platform Timer Service

(a) (b)

(2) (2)

(3)

(1)

There is indeed a spectrum with a full hardware im-plementation on the one hand, like e.g. the Real-TimeUnit (RTU) [33], and a software implementation on theother hand. In between, one finds configurable run-timemanagement accelerators or combined HW/SW solu-tions. These platform run-time management servicesattempt to find the sweet-spot between accelerationand the flexibility to change the policies or to adaptto existing software run-time managers. A downside ofpure hardware acceleration is the limited applicationscalability. For example, the RTU is limited to handling16 tasks at 8 priority levels, 16 semaphores and 8 exter-nal interrupts. Similarly, the Task Control Unit (TCU),described by Theelen et al. [28], is limited to sup-porting 63 independent tasks and 128 communicationresources (i.e. semaphore, mailbox or pipe). However,as Paulin et al. [22] point out, one has to make a trade-off between speed and deterministic execution on theone hand and flexibility/scalability on the other. Fur-thermore, existing software operating systems also canhave limitations on the number of simultaneous userprocesses or file descriptors. Limitations are acceptableif tuned to the application domain.

3.3 Adaptive Versus Non-Adaptive

A general purpose operating system for the desktopPC or workstation attempts to provide fast responsetime for interactive applications, high throughput forbatch applications and an amount of fairness be-tween applications [37]. It relies on generic application-independent heuristics to achieve this [38].

Dealing with the MPSoC run-time managementrequirements and constraints in an efficient way canalso be achieved (1) by tuning the run-time managerwith respect to the application or by exploiting appli-cation knowledge that is either gathered at run-timeor received by the application design flow, and (2) by

tuning the run-time manager as to take into accountthe platform properties and the offered hardware ser-vices. One can consider two categories of adaptationof the run-time manager to the application: design-time adaptation, i.e. designing the run-time manager insuch a way that it suits the needs of the applicationand platform, and run-time adaptation, i.e. changingthe behavior of the run-time manager according to thecurrent application(s) or even deferring some systemmanagement responsibility towards the application.

As Fig. 9 illustrates, design-time adaptation of therun-time manager relies on a library of parameteriz-able and configurable run-time management compo-nents [39, 40]. A run-time manager generation toolis responsible for creating the run-time manager withthe right functionality. This tool takes as input, forexample, a specification of the architecture, the mem-ory and resource allocation map and a high-level (i.e.configurable) description of the application tasks. Byanalyzing the needs of the application tasks, like e.g.communication and synchronization needs, and bylooking at the available platform services, the neededrun-time management functionalities are instantiatedfor every processing element.

Run-time adaptation exploits a closer relationshipbetween the run-time manager and the executing

ArchitectureDescription

Memory map andallocation table

Application: high-leveltask descriptions

OS/RTM Library

OS/RTM generatorAPIsComm/System services

Device Drivers

Generated OS/RTM System Makefiles Application: targetedtask descriptions

Figure 9 High-level flow for automatic generation of application-specific run-time management and automatic application soft-ware targeting [39, 40].


Figure 10 Three typesof run-time adaptation:a run-time manager supportfor adaptive applications(type 1), b the applicationalso configures parts of therun-time manager policies(type 2), c the applicationtakes over part of therun-time managerresponsibilities (type 3).



Resource Manager(Mechanisms)S

yste

m M

anag

er

Application(s)

(a)




yste

m M

anag

er

Application(s)

(b)

Resource Manager


yste

m M

anag

er

Application(s)

(c)

application(s). This has typically been the focus of run-time managers included in adaptive multimedia systems[5, 37, 41, 42] and of adaptive operating systems such asthe Exokernel [43, 44]. Three types of negotiation andadaptation between run-time manager and applicationcan be identified (Fig. 10).

The first type (type 1) occurs when the run-time man-ager supports adaptive applications (Fig. 10a). Adaptiveapplications can be defined as applications that sup-port multiple modes of operation along one or moreresource and/or quality dimensions [4, 5, 37, 45–48].Each application operating mode has its own resourcerequirements and offers some degree of value towardsthe user. This way, the run-time manager is able toselect the most appropriate operating mode for eachexecuting application. This approach promises higheruser value than a simple application accept/reject pol-icy, but requires communication of application design-time analysis information to the run-time manager aswell as a run-time manager capable of handling thisinformation.

The second type (type 2) involves the applicationconfiguring parts of the run-time manager policies(Fig. 10b). This way, generic policies can be tuned spe-cifically towards the needs of a certain application. Thisresults in better decision-making and a more optimalusage of platform resources. Mamagkakis et al. [49]describe a technique in which the dynamic memoryallocator is configured depending on the requestingapplication. This does not only require the communi-cation of application design-time analysis informationtowards the run-time manager (as in the first type),this also requires a run-time manager with configurablemanagement policies.

The third type (type 3) occurs when the applicationtakes over part of the run-time manager responsibil-ities (Fig. 10c). This means that application-specificmanagement is handled within the application itself.The run-time manager is responsible for allocation ormultiplexing of the hardware resources. Noble et al.

[42] detail a set of extensions to the NetBSD oper-ating system. In their setup, the application requestsa set of platform resources and essentially managesthese resources to provide a certain quality level withrespect to the user. At the extreme end resides the MITExokernel [43, 44]. The Exokernel’s sole function isto allocate, deallocate and multiplex the physical plat-form resources. The application (or application librarydeveloper) is responsible for building the necessaryhardware abstractions and for managing the allocatedresources in an efficient way.

4 Multiprocessor Run-Time Management Examples

To substantiate our MPSoC run-time managementdesign space, this section provides a selection of run-time management examples3 for industrial and acad-emic, large-scale, board-level and SoC multiprocessorplatforms. The examples are in no particular order ex-cept that the SoC and embedded run-time solutions arepresented first, followed by the large-scale, parallel sys-tem solutions (starting at the Cosy run-time manager)At the end of this section, Tables 1 and 2 provide a briefsummary of their run-time management functionalityand their implementation space.

Texas Instruments Cumming et al. [50] detail the run-time management approach used by Texas Instru-ments (TI) to support its OMAP MPSoC platforms(Fig. 11a). In essence, the run-time manager is a soft-ware implementation of a Master–Slave configuration.TI created a small, real-time embedded RTOS, de-noted DSP/BIOS, as to provide a dedicated run-timelibrary (RTLib) for its DSP processing elements. TheDSP/BIOS kernel provides basic communication prim-itives and task scheduling functionality on top of the

3The comparison is based on publicly/freely availableinformation.


Table 1 Run-time Managerfunctionality.

Multiprocessor run-timemanager solutions forembedded platforms andlarge-scale, parallel systems.The order of the examplesmatches the order ofdiscussion in the text.

RTM Quality Resource Resource managementfunctionality manager manager mechanisms/RTLib

Embedded TI OMAP –√ √

Linux –√ √

ThreadX –√ √

Quadros RTXC – –√

ST MultiFlex –√ √

RealFAST RTU – –√

Eclipse –√ √

Odyssey√ √ √

Large-scale Cosy –√ √

AsyMOS – –√

K42 –√ √

MIT Exokernel –√ √

Intel McRT –√ √

DSP hardware; so application developers can buildmodern multi-threaded applications in an easy way.

TI developed the DSP/BIOS Link software to sup-port the e.g. OMAP SoC platform, where a generalpurpose master RISC processor is combined with oneor more slave DSP processing elements on a singledie. The DSP/BIOS Link software links a standard,independent operating system executing on the generalpurpose master to the DSP/BIOS kernel executing onthe slave DSP. The purpose of the DSP/BIOS Link isto provide a control and communication API betweenthe general purpose tasks and the DSP/BIOS tasks. Inaddition, it allows the master to boot the slave DSP(s)and to control which algorithms they execute for aspecific application.

As Fig. 11a illustrates, there is a system resourcemanagement component, linked to the operating sys-tem and executing on the general purpose RISC proces-sor. This master resource manager communicates withits counterpart(s), denoted as RM server,4 executingon top of the BIOS kernel. The resource manager isresponsible for selecting and allocating the slave DSP,for task creation and for setting up the communicationstructures, for starting and stopping the tasks and, fi-nally, for deallocating the resources.

Enea Systems Similar to TI, Enea Systems presents aMaster–Slave configuration solution for heterogeneousmultiprocessors systems by combining the feature-richOSE RTOS executing on the master with a compact

4In the microkernel world, only the most basic functionality(BIOS) is incorporated into the RTOS kernel. The run-time man-agement or operating system functionality responsible for morehigh level tasks, such as resource management, execute outsidethe RTOS kernel and are commonly denoted as servers. Hencethe name RM server for the resource management functionality.

DSP kernel, denoted OSEck, for the slaves. A linkhandler provides message passing inter-processor com-munication primitives.

Quadros The Quadros RTXC RTOS provides its ownsolution for multiprocessor platforms (e.g. OMAP plat-form). This RTOS can handle MPSoCs as well as a setof DSPs on a board or a loose collection of PEs (eitherheterogeneous or homogeneous). Their approach is toduplicate the RTOS kernel services on every PE. A linkmanager provides an easy way for tasks on differentPEs to communicate. The link manager relies on anhigh-level message-passing inter-processor communica-tion service. This communication service abstracts theunderlying platform hardware communication service.In that sense, the Quadros solution can be classified asa more Separate Supervisor approach. However, theprovided solution mainly focuses on providing RTLibfunctionality for the application designer. The designeris responsible for deciding on the resource allocation,so there are no actual run-time resource manager poli-cies present. Quadros allows design-time adaptationtowards the needs of the application. The RTXCGentool allows the designer to easily configure the kernelto fit the processing requirements of your applicationand to only include the required kernel services. Thereis no support for run-time adaptation.

ARM MPCore & Linux SMP The ARM MPCoreis a homogeneous embedded multiprocessor platformthat relies on a general purpose operating system, likeLinux, that supports a Symmetric configuration. Thisway, the operating system actually hides the fact thatmultiple processing engines are available which enablesan easy speed-up of the applications: either because theapplication is composed of multiple tasks that executeon different processing elements or because multiplesingle-task applications no longer have to share a single


Table 2 Run-time Managerexamples implementationspace.

Multiprocessor run-timemanager solutions forembedded platforms andlarge-scale, parallel systems.The order of the examplesmatches the order ofdiscussion in the text.

RTM Distribution Adaptivity HW/SWimplement.

Embedded TI OMAP Master–Slave None SWLinux Symmetric None SWThreadX Symmetric Design-Time SWQuadros RTXC Separate Supervisor Design-Time SWST MultiFlex Master–Slave and None HW RTLib

symmetricRealFAST RTU Master–Slave None HW RTLibEclipse Master–Slave None HW RTLibOdyssey Symmetric Run-Time, type 1 SW

Large-scale Cosy Separate Supervisor None SWAsyMOS Master–Slave Run-Time, type 3 SWK42 Symmetric with Run-Time, type 2 SW

separate supervisorMIT Exokernel Symmetric with Run-Time, type 3 SW

separate supervisorproperties

Intel McRT Master–Slave with Design-Time and SW with HWseparate supervisor Run-Time, type 2 RTLib supportproperties

processing element. Linux includes a scheduler with aload balancing policy [38]. The Native Posix ThreadLibrary and the C library act as RTLib [52].

RTOS vendors, like e.g. Express Logic with theirThreadX RTOS, focus on providing SMP support forthe MPCore to increase performance [53]. In contrastto Linux, such an RTOS is faster and more determin-istic. Indeed, the RTOS does not have a user-kernelspace boundary, it has some simple, yet efficient andfast ways for determining the task schedule, and it pro-vides plenty of scheduling opportunity [54]. In addition,it has a smaller memory footprint. While ThreadX alsoprovides some degree of automatic load balancing (i.e.resource management functionality), the RTOS mainly

General Purpose Processor TMS320 DSP

Link DriverOS Adapter

OS kernel& drivers

MCU Bridge Kernel

Link Driver Other drivers

DSP/BIOS Kernel

Resource Manager RM Server

(a)

Core 0 Core 1 Core 2 ... Core N

OperatingSystem

McRT Stub

McRT McRT McRT

(b)

Figure 11 Run-time management approach for a today’s TIOMAP MPSoCs [50] and b for tomorrow’s Intel Tera-scaleplatforms [51].

focuses on providing RTLib functionality. ThreadXdoes not provide any design-time RTOS tuning tool(like the Quadros RTXC), but it does provide the fullRTOS source code. This equally enables the designerto only include the required components.

NetBSD & Odyssey Odyssey [42] extends the NetBSDoperating system and, hence, supports a Symmetricconfiguration. Odyssey provides a collaborative part-nership between the system manager and the applica-tion. The system manager is responsible for resourcearbitration, i.e. for making resource allocation deci-sions, for enforcing these decisions, and for notifyingthe applications about these decisions. Then, every ap-plication independently decides on how to best adapt itsprovided quality given the resource constraints. Hence,the Odyssey extensions enable type 1 application adap-tivity: the different application quality levels are firstprovided to the system manager which takes them intoaccount when deciding on the resource allocation.

RealFast AB RealFast AB developed a Real-TimeUnit (RTU) [33]. The RTU is a commercial hardwareIP block that provides RTLib functionality for the (het-erogeneous) on-chip processors. Communication withthe application and the system management happenswith memory mapped registers and interrupts. TheRTU is linked to a general purpose OS and managesthe application processors. The Silicon TRON project[29] and the Task Control Unit(TCU) of the MμParchitecture [28] provide a similar hardware solution.


Eclipse The Eclipse architecture template (Fig. 12)defines a heterogeneous multiprocessor to be used asa flexible and scalable subsystem for MPSoC plat-forms [24, 34]. Its target application domain is streamprocessing (e.g. video processing). Eclipse combinesthe application flexibility of a general purpose process-ing element (i.e. the CPU) with the efficiency ofapplication-specific hardware processing, denoted ascoprocessors. The hardware shell acts as an interfacebetween processing and communication. As it providesRTLib functionality, it alleviates the coprocessor de-signer from having to worry about system-level issueslike synchronization, data transport and scheduling.The entire system is conceived as a Master–Slave con-figuration: a general purpose processor is responsiblefor configuring the coprocessors and handling theirreported events.

STMicroelectronics MultiFlex For its next generationNomadik platforms, ST Microelectronics has devel-oped a dedicated approach for designing applicationsand performing run-time management [22, 23] denotedas MultiFlex. These new Nomadik platforms containmultiple general purpose processing elements execut-ing e.g. Linux, Symbian or WinCE in a Symmetricconfiguration. In addition, the platforms contain mul-tiple specialized DSPs and ASIPs for handling video,audio and 3D algorithms. These processors act as slaveprocessing elements and receive their tasks from thegeneral purpose processing element cluster. The DSPsand ASIPs rely on hardware schedulers and hardwaremessage passing engines for efficient scheduling andinter-task communication respectively. This approachcombines a Symmetric configuration with a Master–Slave configuration. In addition, the platform containshardware services for providing the most critical run-time library functions. Except for the close relationbetween the MultiFlex design flow and the MultiFlex

CPU Coprocessor Coprocessor

Shell-SWShell-HW

Shell-HW Shell-HW

task-level interface

Communication network

Memory

communication interface

generic RTLibsupport

computation

Figure 12 The Eclipse architecture template [24, 34] combinesa general purpose processor, denoted as CPU, with application-specific coprocessors. The hardware shell provides RTLib func-tionality and it represents the interface between computation andcommunication.

run-time management components, there is no adaptiv-ity with respect to the application.

Cosy The Cosy microkernel operating system was de-signed and optimized for board-level multiprocessorand multicomputer platforms [55, 56]. Cosy has a strongfocus on providing the right platform abstractions toease application development. This includes provid-ing primitives for starting tasks (i.e. components ofa parallel application) at run-time and for inter-taskcommunication. Run-time resource assignment can beperformed manually by the designer or automaticallyby Cosy. In this context, Cosy expects an applicationtask graph that can then be mapped by several run-time mapping functions. Cosy implements a SeparateSupervisor configuration and interacts with the applica-tion in a sense that it takes the task graph into accountto allocate platform resources. Cosy is too heavyweightfor SoC platforms. Other similar microkernel operatingsystems are Amoeba [57, 58], Sprite [57, 59] and Mach[58, 60].

AsyMOS The AsyMOS run-time manager [61] as-signs specific functionality to specific processors in asymmetric, shared memory multiprocessor system. Thismeans that some processors are assigned to handle theapplication code, while others perform system manage-ment functionality. This solution is located between aSymmetric configuration and a Master–Slave configu-ration. It simplifies run-time management and reducesthe amount of interrupts and cache contention on theapplication processors, which increases performanceand predictability. AsyMOS also allows an applicationto insert its specific resource management components(i.e. type 3 adaptivity).

K42 K42, an IBM research run-time manager for64-bit cache-coherent multiprocessor platforms, fo-cuses on high performance, platform scalability andapplication adaptivity [62]. Although K42 implements aSymmetric configuration on the surface, every run-timemanager resource object and associated data structurescan be distributed in an efficient way over the mul-tiprocessor in order to exploit the use of local mem-ory and to avoid global data structures, global datalocks and global management policies. Just like a Sepa-rate Supervisor approach, this approach provides near-linear scalability [62]. The system management can alsobe adapted to the application needs by allowing theapplication (designer) to select the right combinationof provided system management building blocks (i.e.type 2 adaptivity). K42 provides a scheduling infrastruc-ture that supports real-time behavior, resource time-sharing, gang scheduling, and synchronized locks.


Intel McRT Intel recently published its view on theruntime environment for Tera-scale platforms [51].Their Many-Core RunTime (McRT) environment sup-ports heterogeneous platforms as they see future plat-forms containing high performance scalar cores as wellas an array of high throughput cores and accelera-tors. The McRT essentially controls the platform re-sources in a more distributed and cooperative way,while it is linked to a traditional (Master) OperatingSystem that provides all non-core functionality. TheMcRT RTLib functionality provides the applicationdesigner with parallelism primitives like e.g. threadingand synchronization services. In addition, it providesprimitives to support fine-grain atomic memory trans-actions (also denoted as transactional memory sup-port). The fine-grain synchronization and schedulingservices are partly implemented in hardware. Finally,the McRT RTLib is able to translate popular APIslike e.g. the OpenMP API and the PThreads API intothe core McRT API. In order to accommodate thevarious application requirements, the McRT supportsboth design-time and run-time adaptivity. The design-time adaptivity is achieved by only including thoseMcRT modules that are required, while run-time adap-tivity is achieved by providing configurable schedulingpolicies that allow it to flexibly adapt to specific ap-plication needs (i.e. type 2 adaptivity). The schedulingscalability bottleneck is addressed by using cooperativescheduling.

Tables 1 and 2 provide an overview of the discussedrun-time management solutions. Table 1 details theiravailable functionality, while Table 2 describes theirimplementation solution.

Examples Concluding Remarks It is clear that, withrespect to heterogeneous multiprocessor platforms, thecontemporary industrial run-time management func-tionality is mainly focused on providing resource man-agement mechanisms, e.g. support for starting andterminating an application, and RTLib functionality,i.e. on providing the application designers with an ab-straction layer on top of the hardware. More academicapproaches, like Odyssey and Cosy, provide a resourcemanager. Quality management still appears to be aresearch topic. With respect to the implementationspace, we see a trend for moderate distribution of re-source management functionality although the amountof distribution really depends on the MPSoC platformand the application domain. Hardware support for run-time management is also on the rise, with RealFASTalready providing a commercial solution today. TheST MultiFlex approach also relies on hardware RTLibsupport to reach high performance combined with a low

power and deterministic operation. Adaptivity is still inits infancy: academic embedded solutions like Odysseyprovide run-time adaptivity, while the Quadros pro-vides a commercial tool to tune their run-time manager,at design-time, to the needs of the application. Exceptfor the ST MultiFlex approach, all current commercialsolutions have not been designed for the emergingMPSoC environment, but are based on extending orlinking together existing technology. Finally, the Intelview on run-time management for Many-core platformsprovides a peek into the future, where distributionof run-time management, configurability and hardwaresupport are predicted to be mainstream.

5 Upcoming Research Challenges

Run-time management research is likely to be a never-ending story as the run-time manager interfaces with,on the one side, the applications and their design-toolsand, on the other side, with the platform hardware andthe services it provides. As a result, there will be aneed to provide run-time manager innovation as longas platform hardware continues to evolve and as longas new applications, design-tools and user services keeppopping up. In this context, Section 5.1 briefly detailsthe future collaboration of design-time tools and run-time management. Finally, Section 5.2 details the roleof run-time management for managing SoC platformsin the ultra-deep sub-micron silicon processing era.

5.1 Design-time Application Mappingand Run-time Management

In the embedded systems domain, there has alwaysbeen a division of work and a collaboration betweendesign-time tools (or the application mapping flow)and run-time manager. Indeed, the design-tools mapthe application onto the RTLib APIs, while they alsoprovide additional information to the system managerfor managing the application and the user requirementsat run-time. Upcoming application design and opti-mization tools [20, 63] will be able to even generatemultiple application operating points as required forthe quality manager of Section 2.1.1.

However, there is still quite some work to investi-gate and improve the collaboration between today’sand upcoming application design-tools and run-timemanagement components. Indeed, the division of workbetween design-time tools and run-time manager notonly requires a view on the evolution of applicationsand platforms, but also on the economic context.

First, applications are becoming more complex and,just like complex MPSoC platforms, they will be con-


structed using third-party components and services inorder to prevent an application design productivity gap.Secondly, platforms are also evolving from multi-coreto many-core [64], which will have an impact on therole of the run-time system [65]. Finally, economicsplay an increasingly important role in the context ofwhat can be calculated and decided at design-timeand what decisions need to be postponed to run-time.In an environment where time-to-market is essential,where new applications pop-up at a fast pace and whereapplications and services can be downloaded from anysource, one has to rely more on run-time managementto make things work. This shift from design-time to run-time requires researchers (or research teams) that canperform this cross-layer optimization: i.e. from applica-tion design-tools down to the platform services.

5.2 Run-Time Management for Ultra-DeepSub-Micron Technology

In the sub-45 nm silicon processing technology nodes,process variability and reliability issues will start toplay an important role [66]. Variability will cause adifference in behavior for two identical processing ele-ments in a single SoC platform as well as two identi-cal SoC platforms. Furthermore, these differences willvary over time. Reliability issues result in run-timefailure of hardware components and their associatedservices. Essentially this means that the behavior ofthe predictable platform hardware services, on whichthe run-time manager relies, are themselves becomingunpredictable.

As these phenomena depend on run-time conditions,one requires a sort of run-time manager to handle thesesituations. One could consider introducing Knobs andMonitors inside various critical components in orderto (1) detect when a component goes out of its de-signed operating conditions and (2) apply correctivemeasures if possible. This effectively means that costlyworst case design is not necessary and that run-timephenomena such as temperature drift and aging effectscan be managed. In such a case, monitors include e.g.delay monitors, temperature sensors and signal levelmonitors, while the knobs (i.e. a run-time controlledactuator) include e.g. setting the voltage or changingthe speed/power ratio in line drivers.

Minor corrective services might be implemented as(reliable) platform components, while others will re-quire action of the processor-local run-time library.There might be a need to also involve the resourcemanager, e.g. when a certain processing element oron-chip communication link fails. One thing seems cer-tain, the role of the run-time manager will be increas-

ingly important for handling platform variability andreliability issues.

Intel also considers reliability issues and fault toler-ance schemes for its future many-core platforms [67].Also in their schemes, the run-time manager should beable to cope with tiles that are out-of-spec, or that areunder-performing, and with rerouting communicationin case of a faulty on-chip communication link.

6 Conclusion

This paper details the different components of the run-time manager together with the functionality they pro-vide. At a high level, the run-time manager contains asystem management component and a run-time library.In turn, the system management component contains aquality manager and a resource manager. The qualitymanager negotiates quality levels with the applicationsaccording to the run-time needs of the user. The re-source manager makes the resource allocation deci-sions according to a certain policy and it orchestratesthe execution of these decisions through the associatedmechanisms. Through the run-time library, the run-time manager provides hardware abstraction servicesthat are used by the application designer and called bythe application at run-time.

This paper is also the first to describe the MPSoCrun-time management implementation design space.This design space contains three axes: the first axisdeals with the amount of distribution, the second axisdepicts the hardware versus software trade-off spaceand the third axis deals with the amount of run-timemanagement adaptivity towards the application and theplatform.

Finally, this paper briefly details some contempo-rary industrial and academic multiprocessor run-timemanagement solutions and takes a peek into the fu-ture. It is clear that, with respect to MPSoC platforms,the industrial run-time management is mainly focusedon providing RTLib functionality, i.e. on providingthe application designers with a hardware abstractionlayer. Although currently, industry does not provideany real resource or quality management functionality,the ST MultiFlex effort shows that this will be presentfor future platforms. Academia have developed ex-perimental multiprocessor operating systems that pro-vide advanced run-time management capabilities andthat explore the run-time management design spacefor more optimal solutions with respect to resourcemanagement and adaptivity. However, these solutionsoften need re-targeting towards embedded platforms.

However, the (research) trend for MPSoC run-timemanagers is clear: moderate distribution of the run-time


manager over the platform resources, more platformservices to support the run-time manager and moreconfigurability towards the application. Hardware run-time management components and distribution of therun-time management functionality is likely to arrivefirst as this can be provided by MPSoC platform andrun-time management vendors. Adaptivity requires aclose collaboration between run-time manager and de-veloper and/or design tools. Such solutions probablyrequire more effort to deploy on an industrial scale.

Run-time management research is likely to be anever-ending story. As long as MPSoC platforms andapplications continue to evolve, the run-time man-ager functionality and implementation will have to beadapted. This includes updating the relationship andthe division of work between design-time tools and run-time management components. As technology contin-ues to scale, the platform services themselves will besubject to predictability issues. In this case, he run-timemanager could also take up the responsibility of moni-toring and managing platform hardware reliability.

References

1. Benini, L., & De Micheli, G. (2002). Networks on chips: Anew SoC paradigm. IEEE Computer, 35(1), 70–78.

2. Guven, S., & Feiner, S. (2003). Authoring 3D hypermedia forwearable augmented and virtual reality. In ISWC ’03: Pro-ceedings of the 7th IEEE international symposium on wear-able computers (p. 118). Washington, DC: IEEE ComputerSociety.

3. Geilen, M., Basten, T., Theelen, B., & Otten, R. (2005).An algebra of pareto points. Technical Report ESR-2005-2,Eindhoven University of Technology. January 2005.

4. Bril, R. J., Hentschel, C., Steffens, E. F. M., Gabrani, M.,van Loo, G., & Gelissen, J. H. A. (2001). Multimedia QoS inconsumer terminals. In IEEE workshop on signal processingsystems (pp. 332–343). Antwerp, Belgium. September 2001.

5. Khan, S., Li, K. F., & Manning, E. (1997). The utility modelfor adaptive multimedia systems. In Proceedings of the inter-national workshop on multimedia modeling (pp. 111–126).

6. Lee, C., Lehoczky, J., Siewiorek, D., Rajkumar, R., &Hansen, J. (1999). A scalable solution to the multi-resourceQoS problem. In RTSS ’99: Proceedings of the 20th IEEEreal-time systems symposium (p. 315). Washington, DC:IEEE Computer Society.

7. Wust, C. C., Bril, R. J., Hentschel, C., Steffens, L., &Verhaegh, W. F. J. (2004). QoS control challenges for multi-media consumer terminals. In Proceedings of the internationalworkshop on probabilistic analysis techniques for real timeand embedded systems (PARTES). September 2004.

8. Van Raemdonck, W., Lafruit, G., Steffens, E. F. M.,Otero Perez, C. M., & Bril, R. J. (2002). Scalable 3D graphicsprocessing in consumer terminals. In Multimedia and expo,2002. ICME ’02. Proceedings. 2002 IEEE international con-ference on (Vol. 1, pp. 369–372, 26–29). August 2002.

9. Pastrnak, M., Poplavko, P., de With, P. H. N., & vanMeerbergen, J. (2005). Hierarchical QoS concept for multi-

processor system-on-chip. In Proccedings of the workshop onresource management for media processing in networked em-bedded systems (pp. 139–142). Eindhoven, The Netherlands.

10. Pastrnak, M., de With, P. H. N., & van Meerbergen, J.(2006). Realization of QoS management using negotiationalgorithms for multiprocessor NoC. In IEEE internationalsymposium on circuits and systems (ISCAS) (pp. 1912–1915).Kos, Greece. May 2006.

11. International Technology Roadmap for Semiconductors(ITRS) (2005). 2005 edition: Design Chapter. http://public.itrs.net/.

12. Kuhl, J. G., & Casavant, T. L. (1988). A taxonomy of schedul-ing in general-purpose distributed computing systems. IEEETransactions on Software Engineering, 14(11), 1578–1588.

13. Tanenbaum, A. S. (1995). Distributed operating systems.Englewood Cliffs: Prentice Hall.

14. Hu, J., & Marculescu, R. (2003). Energy-Aware mapping fortile-based NoC architectures under performance constraints.In Proceedings of the Asia & South Pacific design automationconference (ASP-DAC). January 2003.

15. Hansson, A., Goossens, K., & Radulescu, A. (2005). Aunified approach to constrained mapping and routing onnetwork-on-chip architectures. In CODES+ISSS ’05: Pro-ceedings of the 3rd IEEE/ACM/IFIP international conferenceon hardware/software codesign and system synthesis (pp. 75–80). New York: ACM.

16. Stuijk, S., Basten, T., Geilen, M., & Corporaal, H.(2007). Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Proc. of 44thdesign automation conference. June 2007 (pp. 777–782).New York: ACM.

17. Yang, P., Wong, C., Marchal, P., Catthoor, F., Desmet, D.,Verkest, D., et al. (2001). Energy-aware runtime schedulingfor embedded-multiprocessor SOCs. IEEE Design and Testof Computers, 18(5), 46–58.

18. Yang, P., & Catthoor, F. (2003). Pareto-optimization-based run-time task scheduling for embedded systems. InCODES+ISSS ’03: Proceedings of the 1st IEEE/ACM/IFIPinternational conference on hardware/software codesign andsystem synthesis (pp. 120–125). New York: ACM.

19. Ma, Z., Scarpazza, D., & Catthoor, F. (2007). Run-time taskoverlapping on multiprocessor platforms. In Proceedings ofthe 3rd workshop on embedded systems for real-time multi-media (ESTIMedia), (pp. 47–52). October 2007.

20. Brockmeyer, E., Miranda, M., Corporaal, H., & Catthoor, F.(2003). Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the de-sign, automation and test in Europe conference (DATE)(pp. 1070–1075).

21. Zucker, D. F., Lee, R. B., & Flynn, M. J. (1998). An auto-mated method for software controlled cache prefetching. InHICSS ’98: Proceedings of the thirty-first annual Hawaii in-ternational conference on system sciences (Vol. 7, p. 106).Washington, DC: IEEE Computer Society.

22. Paulin, P. G., Pilkington, C., Langevin, M., Bensoudane, E.,& Nicolescu, G. (2004). Parallel programming models for amulti-processor SoC platform applied to high-speed trafficmanagement. In CODES+ISSS ’04: Proceedings of the 2ndIEEE/ACM/IFIP international conference on hardware/software codesign and system synthesis (pp. 48–53).New York: ACM.

23. Paulin, P. (2005). SoC platforms of the future: challengesand solutions. In MPSoC’05, July 2005.

24. Rutten, M. J., van Eijndhoven, J. T. J., & Pol, E.-J. D. (2002).Design of multi-tasking coprocessor control for Eclipse.In Proceedings of the tenth international symposium on

http://public.itrs.net/

http://public.itrs.net/


hardware/software codesign (pp. 139–144). New York:ACM.

25. Singhal, M., & Shivaratri, N. G. (1994). Advanced conceptsin operating systems: Distributed, database and multiprocessoroperating systems. London: McGraw Hill.

26. Garcia, J., Ferreira, P., & Guedes, P. (2000). Parallel operat-ing systems. In J. Bazewicz, D. Trystram, & D. Plateau (Eds.),Handbook on parallel and distributed processing. New York:Springer.

27. Jacob, B., Kohout, P., Ganesh, B. (2003). Hardware sup-port for real-time operating systems. In Proceedings of the1st IEEE/ACM/IFIP international conference on hardware/software codesign and system synthesis (pp. 45–51). NewYork: ACM.

28. Theelen, B. D., & Verschueren, A. C. (2002). Architecturedesign of a scalable single-chip multi-processor. In DSD ’02:Proceedings of the euromicro symposium on digital systemsdesign (p. 132). Washington, DC: IEEE Computer Society.

29. Nakano, T., Utama, A., Itabashi, M., Shiomi, A., & Imai, M.(1995). Hardware implementation of a real-time operatingsystem. In Proceeding of the 12th TRON project internationalsymposium (p. 34).

30. Samuelsson, T., Åkerholm, M., Nygren, P., Stärner, J., &Lindh, L. (2003). A comparison of multiprocessor real-timeoperating systems implemented in hardware and software.In International workshop on advanced real-time operatingsystem services (ARTOSS). Porto, Portugal. July 2003.

31. Haukilahti, R. (2002). Energy characterization of a RTOShardware accelerator for SoCs. In Proceeding of the Swedishsystem-on-chip conference.

32. Isaacson, S., & Wilde, D. (2004). The task-resource matrix:Control for a distributed reconfigurable multi-processor hard-ware RTOS. In Proceedings of the international conference onengineering of reconfigurable systems and algorithms (ERSA)(pp. 130–136). Las Vegas, Nevada, USA. 21–24 June 2004.

33. Klevin, T. (2003). Get RealFast RTOS with Xilinx FP-GAs. Xcell Journal, 45. http://www.xilinx.com/publications/xcellonline/xcell_45/xc_pdf/xc_rtos45.pdf.

34. Rutten, M. J., Pol, E.-J., van Eijndhoven, J., Walters, K.,& Essink, G. (2005). Dynamic reconfiguration of streaminggraphs on a heterogeneous multiprocessor architecture. InS. Sudharsanan, V. M. J. Bove, & S. Panchanathan (Eds.),Proceedings of the SPIE (Embedded processors for multime-dia and communications II). (Vol. 5683, pp. 53–63).

35. Nacul, A., Regazzoni, F., & Lajolo, M. (2007). Hardwarescheduling support in SMP architectures. In Proceedings ofthe design automation and test in Europe conference (DATE).Nice, France.

36. El Shobaki, M. (2002). On-chip monitoring of single- andmultiprocessor hardware real-time operating systems. InProceedings of the 8th international conference on real-timecomputing systems and applications (RTCSA). March 2002.

37. Regehr, J., Jones, M., & Stankovic, J. (2000). Operatingsystem support for multimedia: The programming modelmatters. Technical Report MSR-TR-2000-89, MicrosoftResearch. September 2000.

38. Bovet, D., & Cesati, M. (2001). Understanding the LinuxKernel. Sebastopol: O’Reilly.

39. Gauthier, L., Yoo, S., & Jerraya, A. (2001). Automatic gen-eration and targeting of application specific operating sys-tems and embedded systems software. In Proceedings of theconference on design automation and test in Europe (DATE)(pp. 679–685).

40. Gauthier, L., Sungjoo, Y., & Jerraya., A. A. (2001). Auto-matic generation and targeting of application-specific oper-ating systems and embedded systems software. In IEEE

transactions on computer-aided design of integrated circuitsand systems (vol. 20, pp. 1293–1301).

41. Pu, C., & Walpole, J. (1994). A case for adaptive os kernels.In Proceedings of the ACM object oriented programming sys-tems, languages and applications.

42. Noble, B. D., Satyanarayanan, M., Narayanan, D., Tilton, J. E.,Flinn, J., & Walker, K. R. (1997). Agile application-awareadaptation for mobility. In Sixteen ACM symposium onoperating systems principles (pp. 276–287). Saint Malo,France.

43. Engler, D. R., Kaashoek, M. F., & O’Toole Jr., J. W. (1995).The operating system kernel as a secure programmable ma-chine. Operating Systems Review, 29(1), 78–82. January 1995.

44. Kaashoek, M. F., Engler, D. R., Ganger, G. R., Briceño,H. M., Hunt, R., Mazières, D., et al. (1997). Application per-formance and flexibility on exokernel systems. In Proceedingsof the 16th ACM symposium on operating systems principles(SOSP ’97) (pp. 52–65). Saint-Malô, France. October 1997.

45. Nahrstedt, K., & Jin, J. (2002). Classification and comparisonof qos specification languages for distributed multimediaapplications. Tech. Rep. Technical Report UIUCDCS-R-2002-2302/UILU-ENG-2002-1745, Department of Com-puter Science, University of Illinois at Urbana-Champaign.November 2002.

46. Couvreur, C., Brockmeyer, E., Nollet, V., Marescaux, Th.,Catthoor, Fr., & Corporaal, H. (2005). Design-time applica-tion exploration for MP-SoC customized run-time manage-ment. In Proceedings of the international symposium onsystem-on-chip (pp. 66–73). Tampere, Finland. November 2005.

47. Couvreur, C., Nollet, V., Marescaux, T., Brockmeyer, E.,Catthoor, F., & Corporaal, H. (2006). Pareto-based appli-cation specification for MPSoC customized run-time man-agement. In Proceedings of the International Conference onEmbedded Computer Systems: Architectures, Modeling andSimulation (SAMOS) (pp. 78–84). July 2006.

48. Couvreur, C., Nollet, V., Marescaux, T., Brockmeyer, E.,Catthoor, F., & Corporaal, H. (2007). Design-time appli-cation mapping and platform exploration for MP-SoC cus-tomized run-time management. IEE Computers & DigitalTechniques, 1(2), 120–128.

49. Mamagkakis, S., Atienza, D., Poucet, C., Catthoor, F., &Soudris, D. (2006). Energy-efficient dynamic memory allo-cators at the middleware level of embedded systems. InProceedings of the Sixth ACM & IEEE international confer-ence on embedded software (EMSOFT 2006) (pp. 215 – 222).Seoul, Korea.

50. Cumming, P. (2003). The TI OMAP platform approach toSoC chapter 5, (pp. 97–118). Boston: Kluwer.

51. Saha, B., Adl-Tabatabai, A.-R., Hudson, R. L., Menon, V.,Shpeisman, T., Rajagopalan, M., et al. (2007). Runtime en-vironment for tera-scale platforms. Intel Technology Journal,11(3), 207–215. August 2007.

52. Goodacre, J., & Sloss, A. N. (2005). Parallelism and the ARMinstruction set architecture. Computer, 38(7), 42–50.

53. Carbone, J. (2005). A SMP RTOS for the ARM MPCore multi-processor. Design Strategies and Methodologies, 4(3), 64–67.

54. Williams, C. (2002). Linux scheduler latency. Technicalreport, Red Hat Inc.

55. Butenuth, R. (1994). The COSY-Kernel as an example for ef-ficient kernel call mechanisms on transputers. In Proceedingsof the 6th transputer/occam international conference.

56. Butenuth, R., Burke, W., De Rose, C., Gilles, S., &Weber, R. (1997). Experiences in building cosy—an operat-ing system for highly parallel computers. In Proceedings ofthe conference parallel computing: Fundamentals, applicationsand new directions (ParCo) (pp. 469–476).

http://www.xilinx.com/publications/xcellonline/xcell_45/xc_pdf/xc_rtos45.pdf

http://www.xilinx.com/publications/xcellonline/xcell_45/xc_pdf/xc_rtos45.pdf


57. Douglis, F., Ousterhout, J. K., Kaashoek, M. F., &Tanenbaum, A. S. (1991). A comparison of two distrib-uted systems: Amoeba and Sprite. Computing Systems, 4(4),353–384.

58. Tanenbaum, A. S. (1995). A comparison of three microker-nels. The Journal of Supercomputing, 9(1–2), 7–22.

59. Douglis, F. (1989). Experience with process migration inSprite. In Workshop on experiences with building distrib-uted and multiprocessor systems (pp. 59–72), Berkeley, CA:USENIX Association.

60. Black, D. L. (1990). Scheduling support for concurrency andparallelism in the Mach operating system. IEEE Computer,23(5), 35–43.

61. Muir, S., & Smith, J. (1998). AsyMOS—an asymmetricmultiprocessor operating system. In Open architectures andnetwork programming (pp. 25–34). April 1998.

62. Appavoo, J., Auslander, M., Butrico, M., da Silva, D. M.,Krieger, O., Mergen, M. F., et al. (2005). Experience withK42, an open-source, Linux-compatible, scalable operating-system kernel. IBM Systems Journal, 44(2), 427–440.

63. Cockx, J., Denolf, K., Vanhoof, B., & Stahl, R. (2007).SPRINT: A tool to generate concurrent transaction-levelmodels from sequential code. EURASIP Journal on AppliedSignal Processing, 2007(1), 213–213.

64. Blainey, B. (2007). Manycore impacts on commercial applica-tions. December 2007.

65. Penry, D. A. (2007). You can’t parallelize just once: Manag-ing manycore diversity. In Workshop on Manycore comput-ing, Seattle, WA, USA. June 2007.

66. Papanikolaou, A., Wang, H., Miranda, M., & Catthoor, F.(2007). Reliability issues in deep deep sub-micron technolo-gies: time-dependent variability and its impact on embeddedsystem design. In Proceedings of the 13th IEEE internationalon-line testing symposium (p. 121).

67. Azimi, M., Cherukuri, N., Jayasimha, D., Kumar, A., Kundu,P., Park, S., et al. (2007). Integration challenges and trade-offsfor tera-scale architectures. Intel Technology Journal, 11(3),173–184. August 2007.

Vincent Nollet obtained his MSc in Electrical Engineering in1999 from the Vrije Universiteit Brussel (VUB), Belgium, and hisPhD degree in 2008 from the Technische Universiteit Eindhoven(TU/e). In 2001, he started working at IMEC (Belgium) as aresearcher in the multiprocessor SoC (MPSoC) domain. From2005 until recently, he was leading the IMEC MPSoCresearchactivity. Currently, he is managing the Furnaces Area of theIMEC R&D Semiconductor Fab.

Diederik Verkest received the Master and Ph.D. degree inApplied Sciences from the Katholieke Universiteit Leuven(Belgium) in 1987 and 1994, respectively. He has been working inthe VLSI design methodology group of the IMEC on several top-ics related to formal methods, system design, HW/SW co-design,re-configurable systems, and MPSoC systems. He is currentlyin charge of the IMEC design technology for nomadic embed-ded systems. Diederik Verkest is Professor at the University ofBrussels (VUB) and at the University of Leuven (KU-Leuven).He is member of IEEE and a Golden Core Member of the IEEEComputer Society. Diederik Verkest published over 100 articlesin International Journals and Conferences. Over the past yearshe was a member of the program and/or organisation committeesof several major international conferences. He was the GeneralChair of the Design, Automation and Test in Europe Conference,DATE’03.

Henk Corporaal has gained a MSc in Theoretical Physics fromthe University of Groningen, and a PhD in Electrical Engi-neering from Delft University of Technology. Corporaal hasbeen teaching at several schools for higher education, workedat the Delft University of Technology in the field of computerarchitecture and code generation, had a joint appointment atthe National University of Singapore, has been scientific directorof the joined NUS-TUE Design Technology Institute, and hasbeen department head and chief scientist within the DESICSdivision at IMEC (Belgium). Currently Corporaal is Professorin Embedded System Architectures at the Einhoven Universityof Technology (TU/e) in The Netherlands. He has co-authoredmany papers in the (multi-)processor architecture and embeddedsystem design area. Furthermore he has invented a new class ofVLIW architectures, the Transport Triggered Architectures. Hiscurrent research projects are on the predictable design of soft-and hard real-time embedded systems.

Documents

A Safari Through the MPSoC Run-Time Management Jungle