Fast Architecture Evaluation of Heterogeneous MPSoCs by Host-Compiled Simulation

Fast Architecture Evaluation of Heterogeneous MPSoCs

by Host-Compiled Simulation

黃　翔 Dept. of Electrical Engineering

National Cheng Kung University Tainan, Taiwan, R.O.C

2012.09.10

2

AbstractFor evaluating important architectural decisions such as tile

structure and core selection within each tile for future 100–1000 core designs, fast and flexible simulation approaches are mandatory. We evaluate heterogeneous tiled MPSoCs using a timing-approximate

simulation approach.

In order to verify performance goals of the heterogeneous MPSoC apart from functional correctness, we propose a timing-approximate simulation approach and a time-warping mechanism. It allows the investigation of phases of thread (re-)distribution and resource-

awareness with an appropriate accuracy.

For selected case studies, it is shown how architectural parameters may be varied very fast enabling the exploration of different designs for cost, performance, and other design objectives.

3

Introduction Processor architectures are becoming not only more and more

parallel but also increasingly heterogeneous for energy efficiency reasons. This trend toward many-core system designs implementing hundreds to

thousands of cores as well as hardware accelerators on a single chip leads to many different challenges.

such as overheating, reliability and security issues as well as resource contentions

As a remedy, resource-aware programming concepts such as invasive computing have been recently proposed that exploit self-adaptiveness and self-organization of thread generation and distribution.

The heterogeneity poses a big problem on how to evaluate architectural design choices early. such as number of tiles, internal tile structure and selection of cores

within each tile

4

Background on Resource-Aware Programming

The three distinguished programming constructs of invasive computing can be summarized as follows:Invasion:

The first step of an invasive program is to explore its neighborhood and claim resources in a phase.

This is done by issuing a library call to invade. The run-time system then builds and returns a claim object, which denotes a set of

compute resources, that is now allocated to the calling application.Infection:

The second step is to copy the code and the data to the invaded compute resources and execute this code in parallel in an execution phase called infection.

This is done by issuing a library call to infect on the claim object.Retreat:

After the parallel execution terminates, the programmer finally has to free the occupied resources by calling the library function retreat on the claim object.

5

Architecture Evaluation (1/3)We want to simulate the performance of multiple resource-

aware applications running on heterogeneous tiled architectures.

In order to simulate the functional behavior of the applications as well as important timing information, we employ a host-compiled simulation approach. This approach is based on a time measurement on the host processor

and a time warping mechanism for scaling the measured execution time to a modeled target processor.

We decided to use this approach, because easily create heterogeneous tiled architectures change the parameters of the containing processing, memory, and

communication resources evaluate the performance of the architecture and the functional

correctness of the applications in a very short time

6

Architecture Evaluation (2/3) In Figure 1, an overview of our proposed MPSoC architecture evaluation is

depicted.

Figure 1: Overall flow of architecture evaluation.

7

Architecture Evaluation (3/3) The total hardware cost is simply computed by considering the

sum of all underlying computational resources, i.e., only the cost for cores, PEs of a tightly-coupled processor array (TCPA) , and network routers is considered.

The application model allows to define execution scenarios that include multiple competing applications with different degrees of parallelism.

8

MPSoC Architecture Model (1/2) Figure 2 displays a typical example of such an MPSoC

architecture as considered throughout this paper.

Figure 2: A generic invasive tiled architecture with different processor types, accelerator tiles such as TCPAs, and memory tiles.

9

MPSoC Architecture Model (2/2) We characterize a tiled architecture according to Figure 2 by

the following structural parameters: a set T = {T1, . . . , Tm} of tiles with a total of m tiles To each tile Ti T there is associated a size s(T∈ i) (number of cores per tile) A processor type r(Ti) {RISC, i-Core, TCPA} ∈ (assuming each tile is composed only of cores of the same type).

For example, say the upper left tile is T1 in Figure 2. Then, s(T1) = 2(?) and r(T1) = RISC. Similarly, the upper TCPA tile T2 is characterized by s(T2) = 20 and r(T1) = TCPA.

Evidently, the cost of an MPSoC will depend not only on tile number and tile size, but also on the type of processor chosen within each tile.

10

Application and Programming Model(1/3) The applications express their temporal resource requirements

only by using the invasive programming. Applications may request additional compute resources for their

parallel execution or release compute resources in order to make them available for other applications running on the architecture.

An example of an invasive resource-aware program as well as the execution of this program on a tile containing four RISC cores (like the upper left tile in Figure 2) is shown Figure 3.

11

Application and Programming Model(2/3)

Figure 3: Example program using resource aware programming constructs for parallel execution that is requesting two additional CPUs at runtime

12

Application and Programming Model(3/3)① Here, the program initially starts at RISC1.② Then, the program wants to allocate two additional RISC

CPUs for parallel execution. 4: c.add(new TypeConstraint(PEType.RISC)); 5: c.add(new PEQuantity(2));

③ If the invasion of two more cores is successful, a claim containing two free RISC cores is returned. 6: val claim = homeClaim + Claim.invade(c);

This claim and the so-called homeClaim, which denotes the initially assigned core for the application, are merged to a single claim.

④ Now, the parallel execution can be started by issuing an infect command with the appropriate i-let. 10: claim.infect(ilet);

⑤ The resources RISC1, RISC2, and RISC3 are used in parallel until the initial program finally issues a retreat on RISC2 and RISC3 after all child i-lets have terminated. 11: claim.retreat();

13

Host-Compiled Simulation (1/6) Basically, this simulation approach is composed of two parts.

a) A time measurement on the host processor and a time warping mechanism for scaling the measured execution time to a modeled target processor.

b) A synchronization mechanism for simulating parallel applications.

As measurement parameter we use the number of executed instructions on the host processor.

After counting the number of executed instructions, we map this value to an execution time value on the target processor by applying a set of analytical equations. This time mapping, we call time warping. The equations are parameterized by the computational properties of the

target processor and by some general properties of the application.

14

Host-Compiled Simulation (2/6) For an estimation of the execution time t on the target

processor, we apply the following equation:

I : the number of instructions CPIM : instructions that access the main memory CPIC : instructions that can be computed without a memory access The properties of the application provide the fractions of memory

instructions pM and compute instructions pC, where pM, pC [0, 1] R ∈ ⊂and pM + pC = 1.

clock frequency f of the target processor a slowdown factor b, which is the ratio of the required bandwidth B of

all applications on one tile to the maximum available memory bandwidth BM on this tile: b = B / BM

N : the number of parallel execution units ex: In case of a CPU, we set N = 1.

15

Host-Compiled Simulation (3/6) In order to obtain correct simulation results, it has to be

guaranteed that each simulation thread reached at least that point in simulation time at which the modification of the global state occurs. Therefore, a time-based and event-driven synchronization mechanism

is provided that guarantees the causal and time-aware execution sequence of the threads according to specifically defined synchronization points.

We chose each call to the resource-aware programming library as a synchronization point, because at such a point, the shared status information of the underlying architecture model is read and modified.

Two thread types :1. simulation thread

maintains its local simulation time. generate events that contain the current local simulation time of the thread and put

them into the global event list2. synchronization thread

maintains the global simulation time and only this thread advances it events are read and analyzed by the synchronization thread

16

Host-Compiled Simulation (4/6) Both thread types follow a certain procedure for the

synchronization, which is depicted in Fig. 4.

Figure 4: Flowchart of the synchronization mechanism between the simulation of multiple applications and parallel threads.

17

Host-Compiled Simulation (5/6)① The execution of the thread begins by starting the time

measurement function.② The simulation starts immediately and continues until a

synchronization point is reached.③ The time measurement then is stopped.④ The time warping mechanism is applied on the measured

number of instructions.⑤ After updating the local simulation time value by adding the

determined time value to it, an event containing the local simulation time is created and put into the global event list. Now, a barrier synchronization takes place and blocks the thread

until all existing simulation threads as well as the synchronization thread have reached that barrier.

18

Host-Compiled Simulation (6/6)⑥ Only the synchronization thread operates.

It determines the event with lowest time value from the global event list, sets the global simulation time to that value and removes this event from the list.

Again, a barrier synchronization for all threads takes place.

⑦ After synchronization, the simulation threads check whether the global simulation time value equals their local simulation time value. If this is true, the simulation thread continues its simulation, elsewise it

runs into the first barrier again. The synchronization thread directly runs again into the first barrier and

waits for the other threads to synchronize.

19

Experimental Results (1/12)The simulation runs were executed on a Intel Core i7 Quad-

Core CPU with eight virtual cores at 2.93 GHz.We specify the costs and the parameters of the hardware

model for the different types of resources on a heterogeneous architecture according to Table 1.

Table 1: Area cost units and hardware model parameters of the different types of resources used in this experiments.

20

Experimental Results (2/12)Here, we considered a homogeneous architecture consisting of

64 processing resources of the type RISC. Figure 5 shows the different variants of the layouts, we

evaluated in our experiments.This resource-aware application only utilizes homogeneous

architectures and does not contain any part, which could exploit heterogeneity.

Figure 5: Selected tile layouts of a homogeneous architecture. Here, only a grid-based topology is considered. Each tile consists of one or more equal processing resources. The architecture on the left consists of only one tile that contains 64 processing resources.

21

Experimental Results (3/12)We simulated only one resource-aware application.We increased the degree of parallelism of the application by

spawning a given number of threads, which are executed in parallel on the architecture.

The results are shown in Figure 6.

Figure 6: Simulation of a resource-aware application on a homogeneous architecture with different tile layouts. The computation is done in parallel by spawning several threads

22

Experimental Results (4/12)Here, in general, one can see that the application gains a larger

speedup when spawning a low number of threads and runs into a saturation when spawning a large number of threads.

The simulation shows that the tile layout 1×64 results in a higher latency than the other tile layouts when more threads are used, because the bandwidth limitation on a shared memory system slows down the execution time of the threads.

Among the four different considered configurations, the best layout for this application is 16×4. This is a mixture of shared memory and distributed memory system. Here, the bandwidth limit on one tile only affects a few threads.

23

Experimental Results (5/12)Our second experiment simulates a given number of the same

application on the architectures.We estimated the latency for the execution of all applications.We fixed the degree of parallelism by four threads per

application and each application is started at the same time.The results for the different tile layouts are shown in Figure 7.

Figure 7: Simulation of several resource-aware applications in parallel on a homogeneous architecturewith different tile layouts.

24

Experimental Results (6/12)One can see that tile layout 1×64 results in the highest

latencies, because of the bandwidth limit.The other layouts result in lower latencies, because more tiles

are used and the bandwidth limit is distributed over these tiles.Again, tile layout 16×4 results in the best latency.

For the evaluation of the architecture variants against costs and performance, we used the sum of the area units of all containing resources within the architecture as costs value and the number of applications per second the architecture is able to execute as performance value.

The results of the evaluation are shown in Figure 8.

25

Experimental Results (7/12)

In case of the 64×1 layout, the costs are maximized for a homogeneous architecture.

The 1×64 layout has the lowest costs, however also the lowest performance.

The 16×4 layout has a high performance and moderate costs.

Figure 8: Evaluation of different tile layouts against costs and performance.

26

Experimental Results (8/12) In our second series of experiments, we evaluated costs and

performance of heterogeneous architectures.Here, we studied five different configurations of tiled

architectures with a 16×4 layout as shown in Figure 9.

Figure 9: Selected heterogeneous architectures. Each tile may consists of processing resources of type RISC(4), i-Core(1), or TCPA(4x4 or 8x8).

27

Experimental Results (9/12)As application scenario, we considered a compute-bound

resource-aware application, which consists of three sub-algorithms with the following characteristics:a) The first part is a task parallel execution.b) The second part is computationally intensive, which is suited for data

parallel accelerators, like the TCPAs.c) The third part may benefit from custom instructions of ASIPs, such as

the i-Core.

The results of the simulation of several of these applications on the considered architectures are shown in Figure 10.

They show the average latency of the simulated applications.

28

Experimental Results (10/12)

Figure 10: Simulation of several resource-aware applications in parallel on a heterogeneous architecture with different types of contained processing resources.

29

Experimental Results (11/12)Here, one can see that the homogeneous architecture (Arch1)

does not fit the requirements of the application, thus no speedup is observable in this case. The latency is constant over the number of applications, because the

applications are mainly compute-bound and do not slow down from bandwidth limitations.

Among the considered architectures, Arch4 provides the best performance for the applications, which is a mixture of small TCPA tiles and i-Core tiles.

Those variants with a big TCPA tile (Arch3 and Arch5) result in higher latencies than the variants with smaller TCPA tiles.

30

Experimental Results (12/12)Figure 11 shows the evaluation results of our considered

heterogeneous architectures. In the figure, it can be seen that Arch2 and Arch4 dominate

Arch1, Arch3, and Arch5 in costs and performance.

Figure 11: Evaluation of heterogeneous architectures against costs and performance.

31

Conclusion We demonstrated the fast evaluation of the architectural design

space of tile number, tile organization, and processor selection within each tile.

Documents

Fast Architecture Evaluation of Heterogeneous MPSoCs by Host-Compiled Simulation