MPARM: Exploring the Multi-Processor SoC Design … Exploring the Multi-Processor SoC ... the AMBA bus compliant ... evaluate the impact on system performance of architectural parameters

Journal of VLSI Signal Processing 41, 169–182, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

MPARM: Exploring the Multi-Processor SoC Design Space with SystemC

LUCA BENINI AND DAVIDE BERTOZZIDEIS—University of Bologna, Via Risorgimento 2, Bologna, Italy

ALESSANDRO BOGLIOLOISTI—University of Urbino, Piazza della Repubblica, 13 Urbino, Italy

FRANCESCO MENICHELLI AND MAURO OLIVIERIDIE—La Sapienza University of Rome, Via Eudossiana 18, 00184 Roma, Italy

Received February 13, 2003; Revised December 24, 2003; Accepted July 30, 2004

Abstract. Technology is making the integration of a large number of processors on the same silicon die technicallyfeasible. These multi-processor systems-on-chip (MP-SoC) can provide a high degree of flexibility and represent themost efficient architectural solution for supporting multimedia applications, characterized by the request for highlyparallel computation. As a consequence, tools for the simulation of these systems are needed for the design stage,with the distinctive requirement of simulation speed, accuracy and capability to support design space exploration.We developed a complete simulation platform for a MP-SoC called MP-ARM, based on SystemC as modellingand simulation environment, and including models for processors, the AMBA bus compliant communication ar-chitecture, memory models and support for parallel programming. A fully operating linux version for embeddedsystems has been ported on this platform, and a cross-toolchain has been developed as well. Our MP simulationenvironment turns out to be a powerful tool for the MP-SOC design stage. As an example thereof, we use our tool toevaluate the impact on system performance of architectural parameters and of bus arbitration policies, showing thatthe effectiveness of a particular system configuration strongly depends on the application domain and the generatedtraffic profile.

Keywords: system-on-chip simulation, multiprocessor embedded systems, design space exploration

1. Introduction

Systems-on-chips (SoC) are increasingly complex andexpensive to design, debug and fabricate. The costs in-curred in taking a new SoC to market can be amortizedonly with large sales volume. This is achievable only ifthe architecture is flexible enough to support a numberof different applications in a given domain. Processor-based architectures are completely flexible and theyare often chosen as the back-bone for current SoCs.Multi-media applications often contain highly parallelcomputation, therefore it is quite natural to envisionMulti-processor SoCs (MPSoCs) as the platforms of

choice for multi-media. Indeed, most high-end multi-media SoCs on the market today are MPSoCs [1–3].

Supporting the design and architectural explorationof MPSoCs is key for accelerating the design processand converging towards the best-suited architecturesfor a target application domain. Unfortunately we aretoday in a transition phase where design tuning, opti-mization and exploration is supported either at a veryhigh-level or at the register-transfer level. In this pa-per we describe a MPSoC architectural template anda simulation-based exploration tool, which operates atthe macro-architectural level, and we demonstrate itsusage on a classical MPSoC design problem, i.e., the

170 Benini et al.

analysis of bus-access performance with changing ar-chitectures and access profiles.

To support research for general-purpose multipro-cessors in the past, a number of architectural level-multiprocessor simulators have been developed by thecomputer architecture community [4–6] for perfor-mance analysis of large-scale parallel machines. Thesetools operate at a very high level of abstraction: theirprocessor models are highly simplified in an effort tospeedup simulation and enable the analysis of complexsoftware workloads. Furthermore, they all postulate asymmetric multiprocessing model (i.e. all the process-ing units are identical), which is universally acceptedin large-scale, general-purpose multiprocessors. Thismodel is not proper for embedded systems, where verydifferent processing units (i.e general purpose, DSP,VLIW) can coexist.

To enable MPSoC design space exploration, flexibil-ity and accuracy in hardware modeling must be signif-icantly enhanced. Increased flexibility is required be-cause most MPSoC for multimedia applications arehighly heterogeneous: they contain various types ofprocessing nodes (e.g. general-purpose embedded pro-cessors and specialized accelerators), multiple on-chipmemory modules and I/O units, an heterogeneous sys-tem interconnect fabric. These architectures are tar-geted towards a restricted class of applications, andthey do not need to be highly homogeneous as inthe case of general-purpose machines. Hardware mod-eling accuracy is highly desirable because it wouldmake it possible to use the same exploration engineboth during architectural exploration and hardwaredesign.

These needs are well recognized in the EDA (Elec-tronic Design Automation) community and severalsimulators have been developed to support SoC de-sign [7–11]. However, these tools are primarily tar-geted towards single-processor architectures (e.g. a sin-gle processor cores with many hardware accelerators),and their extension toward MPSoCs, albeit certainlypossible, is a non-trivial task. In analogy with currentSoC simulators, our design space exploration enginesupports hardware abstraction level and continuity be-tween architectural and hardware design, but it fullysupports multiprocessing. In contrast with traditionalmixed language co-simulators [7], we assume that allcomponents of the system are modeled in the samelanguage. This motivates our choice of SystemC as themodeling and simulation environment of choice for ourMPSoC platform.

The primary contribution of this paper is not cen-tered on describing a simulation engine, but on intro-ducing MP-ARM, a complete platform for MPSoC re-search, including processor models (ARM), SoC busmodels (AMBA), memory models, hardware supportfor parallel programming, a fully operational operatingsystem port (UCLinux) and code development tools(GNU toolchain). We demonstrate how our MPSoCplatform enables the exploration of different hardwarearchitectures and the analysis of complex interactionpatterns between parallel processors sharing storageand communication resources. In particular we demon-strate the impact of various bus arbitration policieson system performance, one of the most critical ele-ments in MPSoC design, as demonstrated in previouswork [12–15].

The paper is organized as follows: Section 2 de-scribes the concepts of the emulated platform archi-tecture and its subsystems (network, master and slavemodules), Section 3 shows the software support ele-ments developed for the platform (compiler, peripheraldrivers, synchronization, O.S.), Section 4 gives someexamples of use of the tool for hardware/software ex-ploration and bus arbitration policies.

2. Multiprocessor Simulation Platform

Integrating multiple Instruction Set Simulators (in thefollowing, ISSs) in a unified system simulation frame-work entails several non-trivial challenges, such as thesynchronization of multiple CPUs to a common timebase, or the definition of an interface between the ISSand the simulation engine.

The utilization of SystemC [16] as back-bone sim-ulation framework represents a powerful solution forembedding ISSs in a framework for efficient and scal-able simulation of multiprocessor SoCs.

Besides the distinctive features of modeling softwarealgorithms, hardware architectures and SoC or systemlevel designs interfaces, SystemC functionalities makeit possible to plug an ISS into the simulation frame-work as a system module, activated by the commonsystem clock provided to all of the modules (not phys-ical clock). SystemC provides a standard and well de-fined interface for the description of the interconnec-tions between modules (ports and signals). Moreover,among the advantages of C/C++ based hardware de-scriptions, there is the possibility of bridging the hard-ware/software description language gap [17].

MPARM: Exploring the Multi-Processor SoC Design Space with SystemC 171

SystemC can be used in such a way that each mod-ule consists of a C/C++ implementation of the ISS,encapsulated in a SystemC wrapper. The wrapper re-alizes the interface and synchronization layer betweenthe instruction set simulator and the SystemC simula-tion framework: in particular, the cycle-accurate com-munication architecture has to be connected with thecoarser granularity domain of the ISS. The applicabil-ity of this technique is not limited to ISSs, but can beextended to encapsulate C/C++ implementations ofsystem blocks (such as memories and peripherals) intoSystemC wrappers, thus achieving considerable speed-ups in the simulation speed. This methodology trades-off simulation accuracy with time, and represents anefficient alternative to the full SystemC description ofthe system modules (SystemC as an hardware descrip-tion language) at a lower abstraction level. This formersolution would slow-down the simulation, and for com-plex multiprocessor systems this performance penaltycould turn out to be unacceptable.

A co-simulation scenario can also be supported bySystemC, where modules encapsulating C++ code(describing the simulated hardware at a high level of ab-straction, i.e. behavioural) coexist with modules com-pletely written in SystemC (generally realizing a de-scription at a lower level of abstraction). In this way,performance versus simulation accuracy can be tunedand differentiated between the modules.

Based on these guidelines, we have developed amultiprocessor simulation framework using SystemC1.0.2 as simulation engine. The simulated system cur-rently contains a model of the communication ar-chitecture (compliant with the AMBA bus standard),along with multiple masters (CPUs) and slaves (mem-ories) (Fig. 1). The intrinsic multi-master communi-cation supported by the AMBA protocol has been ex-ploited by declaring multiple instances of the ISS mas-ter module, thus constructing a scalable multiprocessorsimulator.

Figure 1. System architecture.

Figure 2. Processing module architecture.

Processing Modules

The processing modules of the system are representedby cycle accurate models of cached ARM cores. Themodule (Fig. 2) is internally composed of the ARMCPU, the first-level cache and peripherals (UART,timer, interrupt controller) simulator written in C++.It was derived from the open source cycle accurateSWARM (software ARM) simulator [18] encapsulatedin a SystemC wrapper.

The insertion of an external (C++) ISS is subject tothe necessity of interfacing it with the SystemC en-vironment. (for example, accesses to memories andinterrupt requests must be trapped and translated toSystemC signals). Another important issue is synchro-nization, the ISS, typically written to be run as a singleunit, must be capable of being synchronized with themultiprocessing environment (i.e. there must be a wayto start and stop it maintaining cycle-accuracy). Last,as a further requirement, the ISS must be capable of be-ing multi-instantiable (for example it must be a C++class), since there will be one instance of the modulefor each simulated processor.

The SWARM simulator is entirely written in C++.It emulates an ARM CPU and is structured as a C++class which communicates with the external worldusing a Cycle function, which executes a clock cy-cle of the core, and set of variables in very close re-lation to the corresponding pins of a real hardwareARM core. Along with the CPU, a set of peripherals isemulated (timers, interrupt controller, UART) to pro-vide support for an Operating System running on thesimulator.

The cycle-level accuracy of the SWARM simulatorsimplifies the synchronization with the SystemC en-vironment (i.e. the wrapper module), especially in amultiprocessor scenario, since the control is returned

172 Benini et al.

to the main system simulator synchronizer (SystemC)at every clock cycle [18].

The interesting thing about ISS wrapping is that withrelatively little effort, other processor simulators can beembedded in our multiprocessor simulation back-bone(e.g. mips). Provided they are written in C/C++, theiraccess requests to the system bus need to be trapped,so to be able to make the communication extrinsic andgenerate the cycle accurate bus signals in compliancewith the communication architecture protocol. More-over, the need for a synchronization between simulationtime and ISS simulated time arises only when the ISSto be embedded has a coarse time resolution, i.e. whenit does not simulate each individual processor clockcycle.

Finally, the wrapping methodology determines neg-ligible communication overhead between the ISS andthe SystemC simulation engine, because the ISS doesnot run as a separate thread and consequent communi-cation primitives are not required, that would otherwisebecome the bottleneck with respect to the simulationspeed.

AMBA Bus Model

AMBA is a widely used standard defining the com-munication architecture for high performance embed-ded systems [19]. Multi-master communication is sup-ported by this back-bone bus and requests for simulta-neous accesses to the shared medium are serialized bymeans of an arbitration algorithm.

The AMBA specification includes an advanced high-performance system bus (AHB), and a peripheral bus(APB) optimized for minimal power consumption andreduced interface complexity to support connectionwith low-performance peripherals. We have developeda SystemC description only for the former one, giventhe multi-processor scenario we are targeting. Our im-plementation supports the distinctive standard-definedfeatures for AHB, namely burst transfers, split transac-tions and single-cycle bus master handover. The modelhas been developed with scalability in mind, so tobe able to easily plug-in multiple masters and slavesthrough proper bus interfaces.

Bus transactions are triggered by asserting a bus re-quest signal. Then the master waits until bus ownershipis granted by the arbiter: at that time, address and con-trol lines are driven, while data bus ownership is de-layed by one clock cycle, as an effect of the pipelined

operation of the AMBA bus. Finally, data samplingat the master side (for read transfers) or slave side(for write transfers) takes place when a ready signalis asserted by the slave, indicating that on the nextrising edge of the clock the configuration of the databus can be considered stable and the transaction canbe completed. Besides single transfers, four, eight andsixteen-beat bursts are defined in the AHB protocol too.Unspecified-length bursts are also supported.

An important characteristic of AMBA bus is that thearbitration algorithm is not specified by the standard,and it represents a degree of freedom for a task-dependent performance optimization of the commu-nication architecture. A great number of arbitrationpolicies can be implemented in our multiprocessor sim-ulation framework by exploiting some relevant featuresof the AMBA bus. For example, the standard allowshigher priority masters to gain ownership of the buseven though the master which is currently using it hasnot completed yet. This is the case of the early bursttermination mechanism, that comes into play when-ever the arbiter does not allow a master to completean ongoing burst. In this case, masters must be able toappropriately rebuild the burst when they next regainaccess to it.

We exploited this bus preemption mechanism to im-plement an arbitration strategy called “slot reserva-tion”. In practice, the bus is periodically granted tothe same master, which is therefore provided with aminimum guaranteed bandwidth even though it cannotcompete for bus access during the remaining period oftime. This strategy allows tasks with particular band-width requirements not to fail as an effect of arbitrationdelays due to a high level of bus contention.

In our SystemC model of the AMBA bus, we exploitthe early burst termination to eventually suspend on-going bursts that are being carried out by the masterowning the bus, and to enable it to resume its operationonce the slot period is expired. This arbitration mech-anism is parameterized, as the slot duration and periodcan be independently set to search for the most efficientsolution. A traditional round-robin policy has also beenimplemented, allowing a comparison between the twostrategies in terms of their impact on a number of met-rics such as execution time, average waiting time ofthe processors before their bus access requests are sat-isfied, degree of bus idleness, number of burst earlyterminations, average preemption time, etc.

Our multiprocessor simulation platform allows de-sign space exploration of arbitration policies, and to


easily derive the most critical parameters determiningthe performance of the communication architecture ofa MP-SoC. This capability of the simulation environ-ment is becoming of critical importance, as the designparadigm for SoC is shifting from device centric tointerconnect centric [20]. The efficiency of a certainarbitration strategy can be easily assessed for multiplehardware configurations, such as number of masters,different master characteristics (e.g. cache size, gen-eral purpose versus application specific, etc.).

Memory Sub-System

The system is provided with two hierarchies of mem-ories, namely cache memory and main memory.

The cache memory is contained in the processingmodule and is directly connected to the CPU corethrough its local bus. Each processing module hasits own cache, acting as a local instruction and datamemory; it can be configured as a unified instructionand data cache or as two separate banks of instruc-tion and data caches. Configuration parameters includealso cache size, line length and the definition of noncacheable areas in the address space.

Main memory banks reside on the shared bus as slavedevices. They consist of multiple instantiations of a ba-sic SystemC memory module. Each memory module ismapped on its reserved area within the address space;it communicates with the masters through the bus us-ing a request-ready asynchronous protocol; the accesslatency—expressed in clock cycles—is configurable.

Multiprocessor Synchronization Module

In a multiprocessing system there is the need for anhardware support for process synchronization in orderto avoid race conditions when two or more processestry to access the same shared resource simultaneously.The support for mutual exclusion is generally providedby ad hoc non-interruptible CPU instructions, such asthe test&set instruction.

In a multiprocessor environment the presence of non-interruptible instructions must be combined with exter-nal hardware support in order to obtain mutual exclu-sion of shared resources between different processors.

We have equipped the simulator with a bank ofmemory mapped registers which work as hardwaresemaphores. They are shared among the processors andtheir behavior is similar to that of a shared memory,

with the difference that when one of these 32 bit regis-ters is read, its value is returned to the requester, but atthe same time the register is automatically set to a pre-defined value before the completion of the read access.In this way a single read of one of the registers worksas an atomic test&set function.

This module is connected to the bus as a slave andits locations are memory mapped in a reserved addressspace.

3. Software Support

Cross-Compilation Toolchain

The cross-compilation toolchain includes the GNUgcc-3.0.4 compiler for the ARM family of processorsad its related utilities, compiled under Linux. The resultof the compilation and linking step is a binary image ofthe memory, which can be uploaded into the simulator.

Operating System Support: uClinux

Hardware support for booting an operating system hasbeen provided to the simulator through the emulationof two basic peripherals needed by a multitasking O.S.:a timer and an interrupt controller. An additional UARTI/O device allows to display startup, error and debuginformation on a virtual console. Linux-style drivershave been written for these devices, running under thelinux 2.4 kernel. The kernel version ported onto theemulation platform consists of a reduced version oflinux (uClinux) for embedded systems without mem-ory management unit support [21].

Our simulation platform allows to boot multiple par-allel uCLinux kernels on independent processors andto run benchmarks or interactive programs, using theUART as an I/0 console.

Support for Multiple Processors

The software support for multiprocessors includes theinitialization step and synchronization primitives, to-gether with some modifications of the memory map.

When a processor performs an access to the memoryregion where it expects to find the exception vectors,the address has been shifted to a different region in themain memory, so that each processor can have its owndistinct exception table. The result is a virtual memorymap specific for each processor (Fig. 3), which must

174 Benini et al.

Figure 3. Memory map.

not be confused with a general purpose memory man-agement support.

Having its own reset vector, each processor can ex-ecute its own startup code independently on the oth-ers. Each processor initializes its registers (e.g. stackpointer) and private resources (timers, interrupt con-trollers). Shared resources are initialized by a singleprocessor while the others wait using a semaphore syn-chronization method. At the end of the initializationstep, each processor branches to its own main routine(namely main0, main1, main2, etc.).

The linker script is responsible for the allocation ofthe startup routines and of the code and data sectionsof the C program.

Synchronization software facilities includes defini-tions and primitives to support the hardware semaphoreregion (multiprocessor synchronization module) at Cprogramming level. The routines consists of a block-ing test&set function, of a non-blocking test functionand of a free function.

4. Experimental Results

Our simulation environment can be used for differ-ent kinds of design exploration, and this section willgive some examples thereof. To this purpose, we usedthe aforementioned software toolchain to write somebenchmark programs for a two-processors system withdifferent levels of data interaction between the twoprocessors.

Figure 4 shows the common system architecture con-figuration used for the examples. Two processing ARM

modules are connected to the AMBA bus and act asmasters, and two identical memory modules are con-nected as slaves and can be accessed by both proces-sors. The third slave module is the semaphore unit, usedfor synchronization in one of the following benchmarkprograms.

Benchmark Description

(a) same data set program (shared data source)The two processors execute the same algorithm(matrix multiplication) on the same data source. Inthis program half the result matrix is generated bythe first processor while the other half is generatedby the other processor (Fig. 5). The two proces-sors share the source data (the two matrixes thathave to be multiplied), but there are no data depen-dencies between them, so there is no need to usesynchronization functions between the processors.

(b) data dependent program (producer-consumeralgorithm)The first processor execute a one-dimensionalN -size integer FFT on a data source stream whilethe second execute a one-dimensional N -size inte-ger IFFT on the data produced by the first proces-sor. For each N -size FFT block completed, a dedi-cated semaphore is released by the first CPU beforeinitiating data elaboration of the subsequent block.The second CPU, before performing the IFFT ona data block will check its related semaphore andwill be locked until data ready will be signaled.


Figure 4. System architecture for benchmark examples.

Figure 5. Matrix multiplication.

Architectural Exploration

In this example we show the results obtained runningthe preceding benchmarks and varying architectural orprogram parameters.

The explored parameters are two, one related to thesystem architecture, cache size, and the other related tothe program being executed, FFT size (which affectsdata locality). The FFT performed on an N -size blockwill be hereafter indicated as “FFT N”.

Using our multiprocessor simulation tool we can ob-tain output statistics in text format as the ones showed inFig. 6. In Figs. 7–9 we graphically illustrate the resultsrelative to contention-free bus accesses (percentage oftimes a CPU is immediately granted the bus againstits access request, with respect to the total number ofbus access requests), average waiting time before gain-ing bus ownership (this delay is a side-effect of the

Figure 6. Statistics collection example (FFT 16 with 512 byte cachesize).

176 Benini et al.

Figure 7. Contention-free bus accesses vs. cache size.

Figure 8. Average waiting for bus cycles vs. cache size.

Figure 9. Cache miss rate vs. cache size.


arbitration mechanism and of the serialization ofbus requests), average cache miss rate for the twoprocessors.

Arbitration Policies Exploration

We performed extensive experiments to evaluate theeffects of bus arbitration policies on the performanceprovided by the system on our benchmark applications.In particular we simulated each benchmark with dif-ferent bus arbiters, implementing round-robin and slot

500 2000 3500 5000 6500 8000 9500Slot duration (µs)

−250

750

1750

2750

3750

4750

Ave

rage

wai

ting

time

(µs)

Processor 1Processor 2

Figure 10. Average waiting time vs. slot duration (slot period = 10 ms).

1100 2600 4100 5600 7100 8600Slot period (µs)

−200

600

1400

2200

3000

3800

Ave

rage

wai

ting

time

(µs)

Processor 1Processor 2

Figure 11. Average waiting time vs. slot period (slot duration = 1 ms).

reservation policies. Slot reservation was parameter-ized in terms of slot duration and slot period, and theeffects of both parameters were analyzed by means ofsimulation sweeps.

Results obtained for benchmark (a) are reported inFigs. 10–14. Figures 10 and 11 show the effects of slotduration and period on the average waiting time permemory access perceived by each processor. Horizon-tal lines on the two graphs represent the average wait-ing time obtained with round robin, which is almost thesame for the two processors. In our experiments, slotsare reserved to processor 1. As expected, increasing the

178 Benini et al.

500 2000 3500 5000 6500 8000 9500Slot duration (µs)

8e+05

2.3e+06

3.8e+06

5.3e+06

6.8e+06

8.3e+06

No.

of b

us c

ycle

s

No. of bus idle cyclesProcessor 1Processor 2

Figure 12. Bus cycles vs. slot duration (slot period = 10 ms).

1100 2600 4100 5600 7100 8600Slot period (µs)

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

8e+06

No.

of b

us c

ycle

s

No. of bus idle cyclesExecution time processor 1Execution time processor 2

Figure 13. Bus cycles vs. slot period (slot duration = 1 ms).

duration of the slot reduces the waiting time of proces-sor 1 and increases that of processor 2.

As regards the slot period, its increase correspondsto an increase of the inter-slot time, as the slot durationis kept constant here, and this translates to an oppo-site effect on waiting times. In fact, in a system withonly two masters the inter-slot time can be viewed asa slot time reserved to the second processor, so thatthe larger the inter-slot time, the better the bus perfor-mance perceived by processor 2. It is worth noting thatslot reservation does never outperform round robin. Infact, in our application the two processors have exactlythe same workload and the same memory access needs,

so that it is counterproductive to implement unbalancedarbitration policies.

Figure 12 shows the overall number of idle bus cyclesand the execution time of the two processors for differ-ent slot durations. Interestingly, the idleness of the bushas a non monotonic behavior, clearly showing a globaloptimum. This can be explained by observing that ma-trix multiplication is the only task in the system, so thateach processor stops executing as soon as it finishes itssub-task. If the two processors have different execu-tion times because of slot reservation, once the fasterprocessor completes execution it cannot generate fur-ther requests, while the bus arbiter keeps reserving time


1100 3100 5100 7100 9100Slot period (µs)

−250

750

1750

2750

3750

4750

Average burst preemption time (µs)No. of preemptions

Figure 14. Preemption (slot duration = 1 ms).

slots to it, thus necessarily causing idle bus cycles. Thenumber of bus idle cycles reported in Fig. 12 suffersfrom this situation, since it is computed over the exe-cution time of the slower processor. A similar behavioris shown in Fig. 13 as a function of the slot period.

Finally, Fig. 14 shows the preemptive effect of slotreservation. The number of preemptions represents thenumber of times a data burst is interrupted because itexceeds the time slot. For a fixed slot duration of 1second, the number of preemptions decreases by in-creasing the slot period and hence the inter-slot time.Notice that the inter-slot time affects only the numberof preemptions of processor 2, while the number ofpreemptions of processor 1 depends only on the slotduration, that doesn’t change in our experiment. Theaverage burst preemption time (reported in Fig. 14) isa measure of the time the interrupted processor has towait before resuming data transfer. For bursts initiatedby processor 2, the preemption time is fixed and equalto the slot duration. For bursts initiated by processor 1,on the other hand, the preemption time grows with theinter-slot time. That’s why the average burst preemp-tion time is a sub-linear function of the inter-slot time.

Simulation Performances

In this paragraph we will show simulation speed andperformances of our simulator. These data are very im-

portant since they are an index of the complexity vs.accuracy trade-off and contribute to the definition ofthe design space that can be actually explored usingMPARM.

In order to have more significant measures a furtherbenchmark was developed. The benchmark consists ofa chain of pipelined data manipulation operations (inthe specific case, matrix multiplications), where eachprocessor is a stage of the pipeline (i.e. processor Noperates on data produced by processor N − 1 and itsdata outputs are the inputs of processor N + 1) and theprocessing load is equally distributed over the CPUs.

This system structure has the advantage of beingeasily expandable (regarding the number of processingstages) with minor modifications to the simulator andto the benchmark code. In that way we can producedata on simulation time and show how it scales withthe number of processors.

Figure 15 shows the simulation time needed for theexecution of about 1.5 million “simulated” cycles oneach processing module. We report the measures as afunction of the number of processors and of the outputstatistics produced (global statistics collection as thosein Fig. 6 or complete signal waveform tracing (VCDfiles) and memory access tracing). The whole simulatorwas running on a Pentium 4, 2.26 GHz workstation.

Simulation speed is in the range of 60000–80000cycles/sec per processor without signal tracing (i.e 2simulated processors proceed at about 30000–40000

180 Benini et al.

Figure 15. Simulation time (seconds).

cycles/sec each, 6 processor, roughly, at 10000 cy-cles/sec each). The collection of global statistics doesnot affect significantly simulation speed, while run-time signal tracing has a deeper impact.

5. Conclusions

We have developed a complete platform for the simula-tion of a MP-SoC, allowing investigation in the param-eter space (related to the architecture configuration or tothe protocols) to come up with the most efficient solu-tion for a particular application domain. Our platformmakes use of SystemC as simulation engine, so thathardware and software can be described in the samelanguage, and is based on an AMBA bus compliantcommunication architecture. ARM processors act asbus masters (like in commercial high-end multimediaSoCs), and the simulation platform includes memorymodules, synchronization tools, and support for systemsoftware (porting of the uClinux OS and developmentof a cross-toolchain.)

We have shown examples of applications whereinour simulation environment is used to explore somedesign parameters, namely cache parameters and busarbitration policies. The applications involve data-independent or data-dependent tasks running on dif-ferent ARM CPUs sharing the main memory througha common AMBA bus. The examples show how toderive important metrics (cache size, average waitingtime for accessing the bus since the request is asserted,etc.) that heavily impact system performance, provingits effectiveness in supporting the design stage of amulti-processor system-on-chip.

References

1. Philips, “Nexperia Media Processor,” http://www.semiconduc-tors.philips.com/platforms/nexperia/media processing/

2. Mapletree Networks, “Access Processor,” http://www.maple-tree.com/products/vop tech.cfm

3. Intel, “IXS1000 media signal processor,” http://www.intel.com/design/network/products/wan/vop/ixs1000.htm

4. P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner,“Simics: A Full System Simulation Platform,” IEEE Trans. onComputer, vol. 35, Issue 2, Feb. 2002.

5. M. Rosenblum, S.A. Herrod, E. Witchel, and A. Gupta, “Com-plete Computer System Simulation: The SimOS Approach,”IEEE Parallel & Distributed Technology: Systems & Applica-tions, vol. 3, Issue 4, Winter 1995.

6. C.J. Hughes, V.S. Pai, P. Ranganathan, and S.V. Adve,“Rsim: Simulating Shared-Memory Multiprocessors with ILPProcessors,” IEEE Trans. on Computer, vol. 35, Issue 2,Feb. 2002.

7. Mentor Graphics “Seamless Hardware/Software Co-Verifica-tion,” http://www.mentor.com/seamless/products.html

8. CoWare, Inc., “N2C,” http://www.coware.com/cowareN2C.html

9. K. Van Rompaey, D. Verkest, I. Bolsens, and H. De Man,“CoWare-a Design Environment for Heterogeneous Hard-ware/Software Systems,” in Design Automation Conference,1996, with EURO-VHDL’96 and Exhibition, ProceedingsEURO-DAC’96, European, Sep. 1996, pp. 16–20.

10. Shubhendu S. Mukherjee, Steven K. Reinhardt, Babak Falsafi,Mike Litzkow, Steve Huss-Lederman, Mark D. Hill, James R.Larus, and David A. Wood, “Wisconsin Wind Tunnel II: AFast and Portable Parallel Architecture,” in Workshop on Per-formance Analysis and Its Impact on Design (PAID), June1997.

11. Babak Falsafi and David A. Wood, “Modeling Cost/Performanceof a Parallel Computer Simulator,” ACM Transactions on Mod-eling and Computer Simulation (TOMACS), Jan. 1997.

12. K. Lahiri, A. Raghunathan, G. Lakshminarayana, and S. Dey,“Communication Architecture Tuners: A Methodology for the


Design of High-Performance Communication Architectures forSystem-on-Chips,” in Design Automation Conference, 2000.Proceedings 2000. 37th, 2000, pp. 513–518.

13. K. Lahiri, A. Raghunathan, and G. Lakshminarayana, “LOT-TERYBUS: A New High-Performance Communication Archi-tecture for System-on-Chip Designs,” in Design AutomationConference, 2001. Proceedings, 2001.

14. K. Anjo, A. Okamura, T. Kajiwara, N. Mizushima, M. Omori,and Y. Kuroda, “NECoBus: A High-End SOC Bus with aPortable & Low-Latency Wrapper-Based Interface Mechanism,”in Custom Integrated Circuits Conference, 2002. Proceedings ofthe IEEE 2002, 2002, pp. 315–318.

15. Ryu Kyeong Keol, Shin Eung, and V.J. Mooney, “A Compar-ison of Five Different Multiprocessor SoC Bus Architectures,”in Digital Systems, Design, 2001. Proceedings. Euromicro Sym-posium on, 2001, pp. 202–209.

16. Synopsys, Inc., “SystemC, Version 2.0,” http://www.systemc.org.

17. G. De Micheli, “Hardware Synthesis from C/C++ Models,”DATE’ 99: Design Automation and Test in Europe, Mar. 1999,pp. 382–383.

18. M. Dales, SWARM—Software arm http://www.dcs.gla.ac.uk/∼michael/phd/swarm.html

19. ARM “AMBA Bus,” http://www.arm.com/armtech.nsf/html/AMBA?OpenDocument&style=AMBA

20. J. Cong, “An Interconnect-Centric Design Flow for NanometerTechnologies,” Int. Symp. VLSI Technology, Systems, and Appli-cations, June 1999, pp. 54–57.

21. uClinux – www.uclinux.org.

Luca Benini received the B.S. degree (summa cum laude) in electri-cal engineering from the University of Bologna, Italy, in 1991, andthe M.S. and Ph.D. degrees in electrical engineering from StanfordUniversity in 1994 and 1997, respectively. He is an associate pro-fessor in the department of electronics and computer science in theUniversity of Bologna. He also holds visiting researcher positionsat Stanford University and the Hewlett-Packard Laboratories, PaloAlto, CA.

Dr. Benini’s research interests are in all aspects of computer-aideddesign of digital circuits, with special emphasis on low-power appli-cations, and in the design of portable systems. He is co-author of thebook: Dynamic Power management, Design Techniques and CADtools, Kluwer 1998.

Dr. Benini is a member of the technical program committeefor several technical conferences, including the Design AutomationConference, the International Symposium on Low Power Design andthe International symposium on Hardware-Software [email protected]

Davide Bertozzi received the B.S. degree in electrical engineeringfrom the University of Bologna, Bologna, Italy, in 1999.

He is currently pursuing the Ph.D. degree at the same Universityand is expected to graduate in 2003. His research interests concernthe development of SoC co-simulation platforms, exploration of SoCcommunication architectures and low power system [email protected]

Alessandro Bogliolo received the Laurea degree in electrical engi-neering and the Ph.D. degree in electrical engineering and computerscience from the University of Bologna, Bologna, Italy, in 1992 and1998.

In 1995 and 1996 he was a Visiting Scholar at the Computer Sys-tems Laboratory (CSL), Stanford University, Stanford, CA.

From 1999 to 2002 he was an Assistant Professor at the Depart-ment of Engineering (DI) of the University of Ferrara, Ferrara, Italy.Since 2002 he’s been with the Information Science and TechnologyInstitute (STI) of the University of Urbino, Urbino, Italy, as Asso-ciate Professor. His research interests are mainly in the area of digitalintegrated circuits and systems, with emphasis on low power and sig-nal [email protected]

Francesco Menichelli was born in Rome in 1976. He received theElectronic Engineering degree in 2001 at the University of Rome“La Sapienza”. From 2002 he is a Ph.D. student in Electronic Engi-neering at “La Sapienza” University of Rome.

182 Benini et al.

His scientific interests focus on low power digital design, and in par-ticular in level tecniques for low power consumption, power model-ing and simulation of digital [email protected]

Mauro Olivieri received a Master degree in electronic engineering“cum laude” in 1991 and a Ph.D. degree in electronic and computer

engineering in 1994 from the University of Genoa, Italy, where healso worked as an assistant professor. In 1998 he joined the Univer-sity of Rome “La Sapienza”, where he is currently associate profes-sor in electronics. His research interests are digital system-on-chipsand microprocessor core design. Prof. Olivieri supervises several re-search projects supported by private and public fundings in the fieldof VLSI system [email protected]

Documents

MPARM: Exploring the Multi-Processor SoC Design … Exploring the Multi-Processor SoC ... the AMBA bus compliant ... evaluate the impact on system performance of architectural parameters