38
A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency Modelling Zhiliang Qian, Shanghai Jiao Tong University Paul Bogdan, University of Southern California Chi-Ying Tsui, The Hong Kong University of Science and Technology and Radu Marculescu, Carnegie Mellon University In this survey, we review several approaches for predicting performance of NoC-based multicore systems, starting from the traffic models to the complex NoC models for latency evaluation. We first review typical traffic models to represent the application workloads in NoC. Specifically, we review Markovian and non- Markovian (e.g., self-similar or long range memory dependent) traffic models and discuss their applications on multicore platform design. Then, we review the analytical techniques to predict NoC performance under given input traffic. We investigate analytical models for average as well as maximum delay evaluation. We also review the developments and design challenges of NoC simulators. One interesting research direction in NoC performance evaluation consists of combining simulation and analytical models in order to exploit their advantages together. Towards this end, we discuss several newly proposed approaches that use hardware- based or learning-based techniques. Finally, we summarize several open problems and our perspective to address these challenges. Categories and Subject Descriptors: A.1 [General and reference]: Introductory and survey; C.4 [Perfor- mance of systems]: Modeling techniques Additional Key Words and Phrases: Performance evaluation, Network-on-Chips (NoCs), analytical model, average and maximum delay, simulation, traffic models 1. INTRODUCTION With IC technology continuously shrinking down, state-of-the-art computing platforms widely use architectures such as Multi-Processor System-on-Chips (MPSoCs) and Chip Multi-Processors (CMPs); therefore, Network-on-Chips (NoCs) architectures are sug- gested in these designs as the future communication infrastructure to manage the in- formation transfer among the Processing Elements (PEs) [Benini and De Micheli 2002; Dally and Towles 2001]. When designing NoC-based multi-core platforms, the latency metric usually creates a big challenge during the design space exploration [Ogras et al. 2010; Kiasari et al. 2013a]. In order to make a proper design choice, an accurate and fast evaluation of each design candidate which has different configurations is needed [Bogdan and Marculescu 2009; Ogras et al. 2010]. More specifically, to evaluate the This work was supported in part by the Hong Kong Research Grants Council (RGC) under Grant GRF619813, in part by the HKUST Sponsorship Scheme for Targeted Strategic Partnerships. P.B. acknowl- edges the support by the US National Science Foundation (NSF) CAREER Award CPS-1453860 and Cy- berSEES CCF-1331610. Author’s addresses: Z.L.Qian, Micro- and Nano- Department, Shanghai Jiao Tong University, Shanghai, 200240, email: [email protected]; P. Bogdan, Ming Hsieh Department of Electrical Engineering, University of Southern California, 90089, email: [email protected]; C.Y. Tsui, Electronic and Computer Engineering De- partment, Hong Kong University of Science and Technology, email: [email protected]; R. Marculescu, Electrical and Computer Engineering Department, Carnegie Mellon University, 15213, email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c YYYY ACM 1084-4309/YYYY/01-ARTA $15.00 http://dx.doi.org/10.1145/0000000.0000000 ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Embed Size (px)

Citation preview

Page 1: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A

Performance Evaluation of NoC-based Multicore Systems: FromTraffic Analysis to NoC Latency Modelling

Zhiliang Qian, Shanghai Jiao Tong UniversityPaul Bogdan, University of Southern CaliforniaChi-Ying Tsui, The Hong Kong University of Science and Technologyand Radu Marculescu, Carnegie Mellon University

In this survey, we review several approaches for predicting performance of NoC-based multicore systems,starting from the traffic models to the complex NoC models for latency evaluation. We first review typicaltraffic models to represent the application workloads in NoC. Specifically, we review Markovian and non-Markovian (e.g., self-similar or long range memory dependent) traffic models and discuss their applicationson multicore platform design. Then, we review the analytical techniques to predict NoC performance undergiven input traffic. We investigate analytical models for average as well as maximum delay evaluation. Wealso review the developments and design challenges of NoC simulators. One interesting research direction inNoC performance evaluation consists of combining simulation and analytical models in order to exploit theiradvantages together. Towards this end, we discuss several newly proposed approaches that use hardware-based or learning-based techniques. Finally, we summarize several open problems and our perspective toaddress these challenges.

Categories and Subject Descriptors: A.1 [General and reference]: Introductory and survey; C.4 [Perfor-mance of systems]: Modeling techniques

Additional Key Words and Phrases: Performance evaluation, Network-on-Chips (NoCs), analytical model,average and maximum delay, simulation, traffic models

1. INTRODUCTIONWith IC technology continuously shrinking down, state-of-the-art computing platformswidely use architectures such as Multi-Processor System-on-Chips (MPSoCs) and ChipMulti-Processors (CMPs); therefore, Network-on-Chips (NoCs) architectures are sug-gested in these designs as the future communication infrastructure to manage the in-formation transfer among the Processing Elements (PEs) [Benini and De Micheli 2002;Dally and Towles 2001]. When designing NoC-based multi-core platforms, the latencymetric usually creates a big challenge during the design space exploration [Ogras et al.2010; Kiasari et al. 2013a]. In order to make a proper design choice, an accurate andfast evaluation of each design candidate which has different configurations is needed[Bogdan and Marculescu 2009; Ogras et al. 2010]. More specifically, to evaluate the

This work was supported in part by the Hong Kong Research Grants Council (RGC) under GrantGRF619813, in part by the HKUST Sponsorship Scheme for Targeted Strategic Partnerships. P.B. acknowl-edges the support by the US National Science Foundation (NSF) CAREER Award CPS-1453860 and Cy-berSEES CCF-1331610.Author’s addresses: Z.L.Qian, Micro- and Nano- Department, Shanghai Jiao Tong University, Shanghai,200240, email: [email protected]; P. Bogdan, Ming Hsieh Department of Electrical Engineering, Universityof Southern California, 90089, email: [email protected]; C.Y. Tsui, Electronic and Computer Engineering De-partment, Hong Kong University of Science and Technology, email: [email protected]; R. Marculescu, Electricaland Computer Engineering Department, Carnegie Mellon University, 15213, email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]⃝ YYYY ACM 1084-4309/YYYY/01-ARTA $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 2: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:2 Z. Qian et al.

NoC architecture

Application traffic

Performance evaluation

Simulation/Prototyping

vld RLD

Inverse Scan

AC/DC

iQuant

Stripe memory

idct

Up sampARM

Vop

paddingVop

memory

70

362

362

36227

49

313

94

500313

300

353

357

16

a) b)

Feedback

vld RLD

Inverse Scan

AC/DC

iQuant

Stripe memory

idct

Up sampARM

Vop

paddingVop

memory

70

362

362

362

27

49

313

94

500

313

300

353

357

16

Memory

540

540

Memory126 126

R R R

R R R

R R R

PE0 PE1 PE2

PE3 PE4 PE5

PE6 PE7 PE8

Time

Packet rate

Inner loop synthesis

exploration

T1

T2 T3

Task mapping& Core placement

PE6PE2

T4

Routing path

Feedback

Fig. 1. A typical NoC-based MPSoC design flow for Dual Video Object Plane Decoder (DVOPD) application:a) application task graph for DVOPD [Pullini et al. 2007], where the numbers on the edge represent commu-nication data volume b) a typical optimization flow for design space exploration [Ogras et al. 2010; Kiasariet al. 2013a]

performance of each feasible candidate, the designer should first characterize the ap-plication to extract a specific traffic model [Varatkar and Marculescu 2002; Bogdanand Marculescu 2011]. The traffic model needs to abstract two key features, i.e., thesource and destination of every flow, as well as the packets inter-arrival time distribu-tion. For instance, a simple representation of an application uses the application taskgraph [Hu and Marculescu 2003; 2005] to capture these two features (Fig 1-a showsan example of the Dual Video Object Plane Decoder (DVOPD) application from [Pulliniet al. 2007; Bertozzi et al. 2005]). In this type of representation, a directed edge pointsfrom the source to the destination of each flow while the weight on the edge representsthe mean traffic data rate with the Poisson inter-arrival time distribution. After pre-characterizing the application, designers then need to schedule and allocate the taskson the proper PEs followed by the exploration of the PE core placement and routing al-gorithm design in the system. For each feasible design configuration, the performanceanalysis step needs to be performed to evaluate the quality of the design [Ogras et al.2010](shown in Fig. 1-b). Towards this end, both simulation-based and analytical mod-els are used in the NoC design flow.

In general, NoC simulators attempt to model the architectural implementation de-tails of NoC routers (i.e., both the data and control paths) and can provide estimationswith very high accuracy. That is necessary to have detailed performance evaluationbefore prototyping. One limitation is that it usually takes a lot of time to simulate asystem with large size. Moreover, it is not easy to figure out the performance bottle-necks (e.g., the choice of buffer size of a specific router) from the simulation results;which makes the simulations less powerful and efficient in design space explorations[Ogras et al. 2010; Kiasari et al. 2013a]. Because of these inefficiencies, analytical NoCmodels are also used in the early design stage to support fast exploration of a large de-sign space. Compared to simulations, the accuracy in analytical models is compromisedfor the flexibility and speed enhancement [Ogras et al. 2010; Kiasari et al. 2013a].

In this survey, we review and summarize both the analytical and simulation ap-proaches that have been used in evaluating NoC performance. We first notice that aspecific NoC performance highly depends on the traffic/workload applied onto the sys-tem [Bogdan and Marculescu 2011]. Therefore, in section 2, we first review types ofworkload which are widely adopted in the literature. Then, we review the mathemat-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 3: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:3

Trace-based

Execution-based

Synthetic pattern

Application-driven pattern

Memoryless workload

Injection Method

Traffic pattern

Application-driven workload

Fractal/Self-similar workload

Speed

AccuracyPoisson

Fractal

Processor core (PE)

High

High

NoC system

PE

R

PER

PE

R

PE

R

PER

PE

R

PE

R

PER

PE

R

PE

R

PER

PE

R

PE

R

PER

PE

R

Data and control path

Network interface

Memory

Data path

Control path

CrossbarRouter

Packets

To/From Router

Stat

isti

cal

mod

elFig. 2. NoC architectures overview [Dally and Towles 2003] and workload classification based on the trafficpattern (spatial characteristics) and injection method (temporal characteristics)

ical approaches that capture the non-Poisson traffic behavior of the application. Afterobtaining a traffic model to characterize the input information, the next step is to usean NoC model to predict the performance. In section 3, we review the analytical NoCmodels. Both the average and maximum latency models are discussed in this section.In section 4, we summarize the simulation-based techniques for NoCs with large sizes.Then, in section 5, we discuss the research direction which combines the simulation-based and analytical methods to exploit the advantages from both models. Specifically,we review two types of approaches that use learning and hardware based modellingtechniques to speedup the performance evaluation and maintain the prediction accu-racy. In section 6, we summarize the open problems in NoC performance evaluationsand present our perspective. Finally, we summarize and conclude this survey in section7.

2. NOC TRAFFIC ANALYSISIn this section, we first categorize the NoC workload models used in the perfor-mance evaluation. Then we summarize the methods to observe and characterize fractaland non-stationary behaviors. In particular, several representative techniques are re-viewed, including the Hurst-parameter based models [Varatkar and Marculescu 2002],the phase type based models [Kuhn 2013], non-equilibrium statistical physics inspiredmodels [Bogdan and Marculescu 2011] and multifractal models [Bogdan 2015].

2.1. NoC workload categorizationIn general, there are two important aspects when describing an NoC workload [Dallyand Towles 2003], i.e., the traffic patterns and the injection inter-arrival time (IAT) dis-tributions [Bogdan and Marculescu 2011]. The former describes the distribution of thesource and destination nodes of every flow in the topology. The latter dictates when toissue (inject) a packet from the source Processing Element (PE) into the network or thepackets inter-arrival processes at the intermediate router buffer channels, so it repre-sents the temporal characteristics of the traffic [Soteriou et al. 2006]. In Fig. 2, wecategorize various NoC workloads based on these two features (i.e., spatial patternsand temporal injection processes). The terminology in the figure follows the conven-tions and descriptions in [Dally and Towles 2003]. Specifically, for the traffic patterns,

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 4: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:4 Z. Qian et al.

we classify them into synthetic and application-driven patterns. The synthetic trafficpatterns use mathematical methods to create the source and sink PE addresses; whileapplication-driven patterns determine the destination of each flow from the mappingresults of the applications running on target CMP/MPSoC platforms. For the injectionprocesses, mathematical models (e.g., Poisson or Fractal models) can be used to dic-tate the packets IAT distributions. Moreover, the packets IAT can also be extractedfrom real applications. For example, designers can use trace logs or fully execute theapplication to determine when to issue the next packet in a flow.

To evaluate an NoC design, the most accurate workloads should describe both thetraffic pattern and injection method from real applications. In the table of Fig. 2,the crossed entry of realistic pattern (column) and injection method (row) is namedas application-driven workload. It can be further classified into execution-driven andtrace-driven workloads. For the execution-driven workload, the traffic is produced bymodelling the PEs together with the NoC. Specifically, the processor executes the wholeapplication program and decides when to issue a packet at run time. Because such in-jection method requires a detailed model of both the processors and NoC, this type ofevaluation is also named as full-system simulation. In summary, the full-system sim-ulation achieves the highest accuracy; however it takes a very long evaluation time.Moreover, this methodology is not very flexible because if a parameter (e.g., number ofvirtual channels or buffer size) changes, the whole evaluation needs to be performedagain; this introduces additional timing overhead to explore the design space [Ograset al. 2010]. To improve the speed, the trace-based workload has also been used. Intrace-based workload, after performing the full-system simulation once, the traffictraces (e.g., a file recording the exact time/cycle when a PE injects a cache-coherent orrequest/response packet) are logged. Then, during subsequent simulations, the pack-ets are issued into the network following the trace logs. The execution models of theprocessors are no longer required.

One limitation of using the application-driven workloads for MPSoC platforms is theusers need to provide detailed application information at the very beginning. While forgeneral purpose CMP platforms, there already exists a number of benchmarks, such asParsec [PARSEC 2009], Splash-2 [SPLASH-2 1995] and Spec [Spradling 2007] suite, toprovide representative applications in different domains (e.g., high performance com-puting, image/video processing). However, it is difficult to extrapolate the performancefrom these benchmarks to a new application class which may have very different char-acteristics (e.g., different levels of burstiness) [Soteriou et al. 2006; Gratz and Keckler2010]. Another limitation of trace-based workloads is that the dependency among thepackets may be changed after applying them on a new design [Hestness and Keck-ler 2010]. Therefore, researchers have found it is better to pre-analyze the collectedtraces before applying them in the new evaluations. The Netrace tool tries to addressthis dependency problem and improve the evaluation accuracy [Hestness and Keck-ler 2010]. This is achieved by performing a dependency analysis on the traces. Theidentified dependency relationship among packets is kept in the subsequent network-only simulation by preserving the order of packets that is injected in NoC. Similarly,in [Mahadevan et al. 2005], the dependency among the packets is kept by making anew transaction happen only after the PE receiving all responses from its dependentprocessors.

As shown in Fig. 2, in addition to application-driven workloads, mathematical (orstatistical) models are also widely used to describe the traffic patterns or packet inter-arrival times. For the traffic patterns, they can either be abstracted from applicationsand represented in a graph or be synthetically created without using any applicationinformation. For the packet inter-arrival times, they can be described using differentmodels (e.g., Poisson or self-similar models). One advantage of mathematical injection

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 5: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:5

methods is that by controlling some parameters (e.g., the Hurst parameter) in themodel, a variety of traffic inputs with user-desired property (e.g., self-similarity) canbe created. In the following, we review these traffic models. In particular, we elaborateon the memory dependent models (e.g., fractal or self-similar) in NoC traffic analysis.

2.2. Traffic pattern characterizationTo describe the distributions of the traffic sources and destinations, the applicationtask graphs and synthetic traffic patterns are two widely used methods [Marculescuand Bogdan 2009]. In Fig. 2, they belong to the application-driven and synthetic pat-tern, respectively. Specifically, the task graphs are structures used to represent someprofiled applications. In a task graph G = (C,A), the vertex set C represents all compu-tation tasks and each directed edge ai,j in A represents the communication flow fromvertex ci to cj . The weights on the edge characterize the communication data volume(bits) between tasks [Hu and Marculescu 2003]. After mapping tasks onto proper PEs,the architecture characterization graph (ARCG) ARCG = (T, P ) further describes thecommunications among the PEs in NoC [Hu and Marculescu 2003]. There are sev-eral popular task-graph based benchmarks characterizing different multimedia andcommunication applications in NoC, including the MMS (Multimedia system) [Hu andMarculescu 2004b], PIP (Picture in picture), MWD (Multi-window detection), MPEG(MPEG decoder) [Bertozzi et al. 2005], DVOPD (Dual Video Object Plane Decoder)[Pullini et al. 2007], H264D (H264 decoder) [Liu et al. 2011] and LDPC (Low densityparity check encoder/decoder) [Liu et al. 2011; Wang et al. 2014] applications.

In addition to using task graphs, synthetic traffic generators, which operate on thesource and target PE addresses to produce a variety of artificial patterns, are alsoused. For example, in random traffic pattern, every PE in the system can choose a des-tination with same probability. This in general creates uniform traffic loads across dif-ferent channels in the system. For uneven traffic, permutations on the node addressesare used [Gratz and Keckler 2010]. For instance, to obtain the destination address, thepermutation can be performed by exchanging the X- and Y- coordinates of the sourceaddress (i.e., transpose traffic); it can also be performed in bit-level, such as revers-ing order of bits (i.e., bit-reversal), or complementing each bit (i.e., bit-complement)[Dally and Towles 2003]. Compared to task graphs and application-driven workloads,synthetic traffic patterns provide more artificial traffic scenarios to evaluate a design[Gratz and Keckler 2010].

As shown in Fig. 2, for both task-graph based and synthetic-based patterns, theirinjection processes can be memoryless or memory-dependent. To describe the packetinjection process, let a random variable x represent the packet inter-arrival time (IAT)of the target flow. The simplest model to describe the distribution of x is based on Pois-son injection process, which means x follows a negative exponential distribution, i.e.,Px(t) = λe−λt, where Px(t) is the probability density function and λ is the mean packetrate (packets/cycle). In general, Poisson traffic models belong to the type of memory-less workload, which means the probability distribution does not depend on previousstates. One limitation is that it cannot reflect the burstiness and packet dependencies[Wu et al. 2010; Kouvatsos et al. 2005]. Therefore, statistical techniques which consid-ers non-Poisson IAT distributions are needed. Specifically, these models are built uponfractal/self-similar traffic models to capture the short-range and long-range memorydependencies among the IATs. In the following section, we survey these techniques.

2.3. Techniques to model and analyze mono- and multi-fractal trafficIn this section, we summarize the techniques to model fractal/self-similar NoC traffic.To unify the symbols of different approaches, based on the parameter conventions, thenotations in Table I are used in the presentation of this section.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 6: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:6 Z. Qian et al.

Table I. Symbols and notations used in the traffic analysis

Notation Definitionx A random variable (e.g., represents the packet inter-arrival time (IAT))

Px(t) Probability density function (PDF) of x, i.e., probability density of x = tFx(T ) Cumulative distribution function (CDF) of x, i.e., probability of x ≤ TX A random process which is either continuous time (with index variable t) or discrete time

(with index variable n)1X(n) The nth indexed random variable in the random process XXm A discrete time random process, whose element is the random variable averaged over m

consecutive data in XXm(n) The nth indexed random variable in process Xm

E(X) Expectation (mean) of the random process Xσ2(X) Variance of the random process X; σ(X) is the standard derivation

σ2(Xm)/V ar(m) Variance of the random process Xm, which is a function of parameter mrX(l) Autocorrelation function of random process X under the lag size l

fX(w)/fX(w,H) Spectral density function of X; w is the frequency parameter; H is hurst parametern(t) A random variable represents the number of packets generated during interval [0, t)Ni(t) Average number of router channels in NoC that have i packets at time t

1In section 2.3.1, we assume X is second-order stationary (wide-sense stationary). Second-order stationary meansthe first two moments of X (mean and co-variance) does not depend on specific index variable t (for continuoustime) or n (for discrete time).

2.3.1. Review of LRD/fractal/self-similar traffic . Self-similarity, fractal and long-range de-pendence (LRD) are concepts that have been developed from observing the naturaldata sequences with memory dependencies between two different time instances orindexes [Bogdan et al. 2010; Yoshihara et al. 2001; Ryu and Lowen 2000]. There areseveral ways to formally characterize self-similar/fractal/LRD traffic flows. For exam-ple, suppose a set of random variables is used to represent the packet arrivals from thesame flow over time (i.e., a sequence of random variable x denoting the packet inter-arrival times). This random variable sequence forms a discrete time random processX. The random variable X(n) with an index n represents the IAT between the nth and(n+ 1)th packet. When analyzing such random process X, we usually start from wide-sense stationary (or namely second-order stationary) assumption [Yoshihara et al.2001], which means the first (i.e., mean) and second moment (i.e., co-variance) of Xdoes not depend on specific index choices of the random variables. For example, for adiscrete time random process X, by definition, the mean of X should be a function ofn where the nth element corresponds to the expected value of X(n). Under the wide-sense stationary condition, the mean at different indexes are the same. Therefore itcan be represented by a single parameter E(X).

Based on above assumption, the definition "long-range-dependency" comes from ob-serving the correlations between two random variables in X, whose indexes are l lagsaway. Formally, the autocorrelation function rX(l) of X is defined as [Varatkar andMarculescu 2004; Park and Willinger 2000]:

rX(l) =E[(X(n)− E(X))(X(n+ l)− E(X))]

σ2(X)(1)

where X(n) and X(n + l) are two random variables at index n and n + l, respectively.Under second-order stationary condition, the auto-correlation rX(l) only depends on land does not rely on the choices of instance n.

Using Eqn. 1, rX(l) can be computed with respect to different choices of value l.Intuitively, rX(l) reveals how current data is affected by its neibhors that are l lagsaway. For short-range dependency (SRD) series, rX(l) decreases exponentially with l;while for long-range dependent (LRD) random process, rX(l) decreases much slowly.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 7: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:7

Theoretical derivations have shown a close to power law decreasing with respect tothe leg length l [Varatkar and Marculescu 2004; Park and Willinger 2000].

Besides examining the autocorrelation function rX(l), the dependency among thedata samples can also be presented by the concepts of "self-similarity" or "fractal".These two definitions come from observing the random process Xm under differentscale levels [Varatkar and Marculescu 2004]. More precisely, in these definitions, foreach scale level m, where m is an integer value chosen by the observer, a new pro-cess Xm can be produced by making each indexed random variable Xm(n) equal to:Xm(n) =

∑mi=1 X((n − 1)m + i)/m. The variance of Xm is then calculated and is a

function of the scale-level m. In Table I, this variance σ2(Xm) is also represented asV ar(m). Of note, when calculating V ar(m), for independent or SRD sequence X, bytaking the average of every m samples in X, it is very likely the data is smoothed.Consequently the data variance of Xm should decrease very fast. Typically, V ar(m)decays with m exponentially. On the other hand, for self-similar X, the increasing ofscale level m does not significantly reduce the variance of the new sequence Xm. Thisis reflected in V ar(m), which decreases much slowly as: V ar(m) ∼ m−β , where β typi-cally locates in the range (0, 1) [Varatkar and Marculescu 2002].

The third way to describe a LRD process X is to transfer and observe the series inthe frequency domain. Formally, let f(w) represent the spectral density of the randomprocess X; for fractal/self-similar X, it has been shown f(w) ∼ bw−γ when w → 0[Varatkar and Marculescu 2004].

Based on above discussions, two methods have been widely used to examine whethera time series is long-range dependent or self-similar 1. The detail derivations are pro-vided in [Varatkar and Marculescu 2002; 2004; Min and Ould-Khaoua 2004]. Here, wejust highlight the ideas in those works as follows: The first method is the variance-timeanalysis which plots the curve of log(V ar(m)) against log(m). For LRD/self-similar pro-cess X, it has been shown log(V ar(m)) decreases linearly with log(m) and the slope −βsatisfies 0 < β < 1. The second method is based on Hurst effect. Specifically, it cal-culates the "rescaled adjusted range statistics" (i.e., R/S statistics [Qian and Rasheed2004]) of the random process X. Then, the Hurst parameter H is calculated as thechanging rate of the R/S statistics with respect to the data sequence size n. By exam-ining the value of H, the dependencies among the time series can be described quan-titatively. For memoryless time series, H = 0.5; while for LRD sequences, 0.5 < H < 1.Moreover, it is observed the Hurst parameter H in R/S method equals to 1−β/2 in thevariance-time method.

2.3.2. Generative traffic models for self-similar traffic. The variance time and R/S statisticmethods are useful in analyzing time sequences. However, they can not be used togenerate self-similar traffic. To produce fractal traffic, two kinds of models are widelyused. The first type utilizes only one parameter (e.g., the Hurst parameter H) to rep-resent the dependency and self-similarity (e.g., [Varatkar and Marculescu 2004] and[Soteriou et al. 2006]). While the second type is based on "Phase method" which usesdifferent phases to describe the packets generation process [Kuhn 2013]. Examples arethe Generalized-exponential (GP) process [Wu et al. 2010], Markov-modulated Pois-son process (MMPP) [Kiasari et al. 2013b; Fischer and Meier-Hellstern 1993] and amore generalized Markov arrival process (MAP) [Diamond and Alfa 2000; Klemm et al.2002]. In the following, we summarize the principles of these two models, respectively.

1For non-stationary random processes, the "detrended fluctuation analysis" (DFA) [Kantelhardt et al. 2002]should be applied, which uses a regression function to pre-fit the temporal trend over the data. The readerscan find more details and a step-by-step tutorial of implementing DFA in matlab in [Ihlen 2012]

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 8: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:8 Z. Qian et al.

f(w,H)=FGNspecified power

spectrum

Discrete g(H)spectrum

Expand freqency spectrum to length n

Inverse DTFT

TransformationFX

-1(FX*(X*(n)))

Desired inter-arrival times FX(t)

n = Trace length

1

Approximating f(w,H)

Power spectrum

g(w,H)

Frequency domainH = Hurst parameter

2

Time domain

3

4

Sampling

Time series X*

Fig. 3. Procedures of FGN-based synthetic traffic generation [Varatkar and Marculescu 2004]

1) Hurst-parameter-based modeling: In [Varatkar and Marculescu 2002; 2004], theauthors for the first time propose to consider the dependencies in NoC traffic by usingthe Hurst parameter H. To produce a synthetic packet sequence with a user-neededself-similarity level H (0.5 < H < 1), Fractional Gaussian Noise (FGN) model [J.Beran1994] is used. Their procedures of applying FGN model to generate synthetic NoC traf-fic traces is shown in Fig. 3. We summarized their methods as follows: the inputs to thegenerative model include the user-specified H value, the length of the total sequence nand the desired packet inter-arrival time (IAT) distribution Fx(t). The first step of theoverall procedure is to produce a data series that has a self-similarity level H; this isdone via sampling the FGN spectrum f(w,H) (w is the frequency component), whichis given by [Paxson 1997]:

f(w,H) = A(w,H)× [|w|−2H−1 +B(w,H)](H ∈ (0, 1);w ∈ (−π, π)) (2)In Eqn. 2, A(w,H) and B(w,H) are specific frequency functions whose closed-form

are derived in [Paxson 1997]. In order to obtain f(w,H), it requires to compute a sum-mation of infinite terms due to the existence of function B(w,H). Following the stepsin [Paxson 1997], a simple approximation of f(w,H) with finite terms of summationis obtained and is denoted as g(w,H). The continuous power spectrum g(w,H) is thenused as the input of the subsequent sampling procedure (i.e., Step 2 in Fig. 3). Aftersampling, a Discrete Fourier Transform (DFT) spectrum with n/2 points is obtained.Based on the symmetric property of the spectrum, the DFT components can be mir-rored and extended to the size of n. Transforming these n points in the frequencydomain back via "inverse Discrete Time Fourier Transform" (IDTFT) creates a n-pointtime series X∗ (Step 4). This time series has a desired Hurst parameter H. However,it is noticed in the previous sampling step, since the FGN spectrum is used, the out-put data sequence actually has a Gaussian distribution shape whose mean equals tozero. Consequently, we need to match the distribution of X∗ with the desired cumula-tive distribution function Fx(t). To achieve this, the following transformation is usedto create the nth data X(n) from X∗(n) [Varatkar and Marculescu 2004]:

X(n) = F−1x (Fx∗(x∗(n))) (3)

where Fx∗(t) is the cumulative distribution function (CDF) of the sequence X∗ afterstep 4. X is the created new time series with X(n) denoting the nth packet IAT. Thegenerated traffic trace can then be fed into NoC simulators to evaluate the latencyperformance under the impacts of fractal traffic inputs.

2) Phase-type random process based traffic models: Phase-type models belong tothe type of generative methods which are widely adopted in queuing analysis to re-produce the customer arrival or service process using multiple stages (phases) [Klein-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 9: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:9

rock 1975]. More precisely, each phase is a simpler random process (e.g., Markovianprocess) dictating the specific time interval distribution in current stage. Every time,after experiencing this time interval, the process may switch to a next stage (phase)with a certain probability or directly enter into the ending (absorption) state whichmeans the generative process (e.g., the inter-arrival time or the service time) finishesfor the current customer [Kuhn 2013]. In the following, we review three such models.

In [Kouvatsos et al. 2005; Wu et al. 2010], the Generalized exponential (GE) trafficmodel is used in the queuing network analysis. Specifically, a direct packet generationpath with zero inter-arrival time is added to the Poisson IAT model. In this way, thespecific time intervals of packets can come from the normal memoryless model withprobability τ or alternatively from the direct path with probability 1 − τ [Wu et al.2010]. By controlling parameter τ (i.e., controlling the number of packets being issuedin burst with zero intervals), different level of arrival burstiness can be modelled.

In [Kiasari et al. 2013b; Yoshihara et al. 2001], the traffic burstiness is proposed tobe modelled as an Markovian Modulated Poisson Process (MMPP). In these works, anm−state MMPP is made up of m separate Markovian phases. Each stage i describesa negative exponential IAT distribution with a mean value 1/λi (λi is the averagepacket generation rate in state i). In order to represent the transition rates (or namelytransition probabilities) between any two states, a matrix Q is used. Specifically, thematrix entry of row i and column j in Q (i.e., qij) corresponds to the rate (or probability)that changes from phase i to j [Kleinrock 1975].

In MMPP traffic models, a function n(t) which represents the total number of pack-ets generated during time interval (0, t] is used to determine λi of each state and qijin matrix Q [Yoshihara et al. 2001]. Specifically, to fit the traffic traces using a 2-stateMMPP-based traffic model, the mean E(n(t)), the variance σ2(n(t)) as well as the In-dex of Dispersion Count (IDC) IDC(n(t)) of the random function n(t) are measuredfor the application traffic first; then, the fitting procedure can be performed by match-ing the measured statistics with the closed-form formula of these metrics derived in astochastic 2-state MMPP model [Shahram and Tho 2000; Yoshihara et al. 2001].

The 2−state MMPP traffic model has been widely used due to its simplicity, althoughit sometimes still introduces certain errors in fitting some applications [Shahram andTho 1998]. To better capture the traffic self-similarity, multiple Markovian states-based modelling techniques were developed in which the traffic is modelled by themixture of several interrupted Poisson processes (IPPs) [Yoshihara et al. 2001; Minand Ould-Khaoua 2004].

In [Yoshihara et al. 2001], a multiple state MMPP traffic model was developed whichconsists of d (d > 1) IPPs. To be more specific, the d-state MMPP traffic model is ob-tained by aggregating several 2-state MMPP processes described previously. Comparedto the case of a 2-state MMPP, there are more parameters need to be derived in thed-state model. The traffic trace is pre-processed under d different time scales. For eachscale level m (e.g., m = 1, 2, .., d), a new data series is generated by averaging the orig-inal sequence with a time window size m. Then, the parameter fitting procedure isperformed on the obtained d sequences separately.

Besides MMPP models, a more generalized Markov Arrival Process (MAP) has alsobe used in traffic analysis [Klemm et al. 2002]. In summary, the fitting algorithm ofsuch traffic model is still based on matching the statistic moments which are obtainedanalytically with the measurements in given traffic traces. For example, in [Casaleet al. 2008], a MAP traffic model is obtained by matching the first three moments ofIAT as well as the correlation between X(n) and X(n+ 1) (i.e., E[X(n)×X(n+ 1)]) inthe analytical model with the real traffic measurements.

Compared to the Hurst-parameter-based traffic modeling, phase-type based trafficmodels are compatible with many existing queuing models (such as the MMPP/G/1

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 10: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:10 Z. Qian et al.

Fig. 4. An example illustrating the non-stationary traffic characteristics in a 4×4 mesh NoC (From [Bogdanand Marculescu 2010; 2011]): a) for a channel port, the packet header flit inter-arrival time distribution isshown; the IAT is compared with Poisson process b) the plot of third order moment versus two time lagsderived from the FGN model c) the actual third order moment calculated from the application traffic

and MAP/G/1 queues [Min and Ould-Khaoua 2004]). They can be used in a perfor-mance model to evaluate NoC latency under specific traffic characteristics. However,for highly LRD traffic models, it is usually very difficult to obtain a closed-form solu-tion in queuing theory. Moreover, most phase-type traffic models are based on station-ary or wide sense stationary IAT assumption. Therefore, they do not well model thenon-stationary inter-arrival processes. In the next sections, we survey the techniquesproposed for multi-fractal NoC traffic analysis.

2.3.3. Motivations for multi-fractal traffic modeling. Many existing NoC traffic models as-sume that the random process X is wide-sense stationary (WSS). Consequently, thetraffic models built upon techniques such as FGN (shown in Fig. 3) can be classifiedas mono-fractal analysis, where the traffic correlations between time instance t andt + δt can be represented using a single and unified parameter H [Bogdan and Mar-culescu 2010]: |X(t + δt) − X(t)|q ∼ (δt)qH . Recently, some profiling results of realNoC applications show a different non-stationary behavior due to the dependenciesamong different PEs as well as the request-response correlations of the packets [Bog-dan and Marculescu 2011; 2010]. As a result, many traffic traces are non-stationaryand can not be fully characterized by a single Hurst exponent. Instead, they arebetter characterized by a set of scaling exponents [Bogdan and Marculescu 2010]:|X(t + δt) − X(t)|q ∼ (δt)g(q), where q is the moment index described in [Bogdan andMarculescu 2011; 2010] and g(q) is a function of q which is non-linear in general.

Based on the above observations, one interesting and on-going research directionon NoC traffic modeling is to extend beyond the existing single exponent (i.e., mono-fractal) traffic model to a more accurate multi-fractal model.

In Fig. 4-a, an example is shown which compares the Poisson, mono- and multi-fractal traffic models [Bogdan and Marculescu 2010; 2011]. In this example, a sequenceof packet inter-arrival times (IAT) is used for analysis 2. From Fig. 4, it shows thatwhen observing a specific router in NoC, the IAT may have a long tail form while con-ventional Poisson model cannot capture this kind of long-range dependency. To capturethe burstiness, a Fractal Gaussian Noise (FGN) model described in Section 2.3.1 canbe employed. That model is usually accurate in matching the first two moments ofthe input traffic trace. However, [Bogdan and Marculescu 2010] further compared the

2The IAT distribution is plotted for the west input port of a node in a 4 × 4 mesh NoC under the executionof a multi-threaded NoC application [Bogdan and Marculescu 2010; 2011]

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 11: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:11

Fig. 5. The multifractal spectrum showing the multifractalness of realistic NoC applications (From [Bogdan2015]): a) the multifractal spectrum of 3 cores in Intel SCC platform executing SPEC applications b) themultifractal spectrum of 7 cores in an 8× 8 NoC executing a PARSEC application.

third order cumulant C3 which is defined as:

C3 = X(t1)X(t2)X(t3)−X(t1)×X(t2)X(t3)−X(t2)×X(t1)X(t3)

−X(t3)×X(t1)X(t2) + 2X(t1)×X(t2)×X(t3)(4)

where t1, t2 and t3 are three instances of the overall data sequence Xt. For an IATdistribution fitted by a Gaussian random process, the plot of C3 with respect to twointervals t2 − t1 and t3 − t1 (i.e., time Lag 1 and Lag 2 in the figure) is shown in Fig.4-b. In Fig. 4-b, C3 is zero across different combinations of t2 − t1 and t3 − t1 in FGNmodel. However, the actual third order moment of the IAT is also plotted in Fig. 4-c.It can be observed C3 of the actual trace collected from simulations has several peaksand is non-uniform in general, which are not captured by the mono-fractal Gaussianmodel. Because of this, it motivates to develop a new multi-fractal analysis frameworkfor non-stationary traffic modelling.

2.3.4. Multifractal spectrum of real applications. One useful method to identify a multifrac-tal data series from mono-fractal sequence is to build the multifractal spectrum (ornamely singularity spectrum), which plots the singularity dimension D(h) against theholder exponent h [Lopes and Betrouni 2009]. Intuitively, the holder exponent h dic-tates an additional (t−ti)

h term when using expansions to approximate the data seriesaround point ti; while D(h) is the fractal dimension of the data set that is made up ofthe points having the same h [Physionet 2004]. For mono- or memoryless series, themultifractal spectrum has a narrow width. For multifractal sequences, the multifrac-tal spectrum spans a variety of h values [Ihlen 2012]. In [Bogdan 2015], the authorplotted the multifractal spectrum for two realistic NoC applications (shown in Fig. 5).Specifically, Fig. 5-a is the multifractal spectrum observed from three cores in IntelSingle-Chip Cloud Computer (SCC) platform executing a SPEC MPI application. Fig.5-b is the spectrum for a 8 × 8 mesh NoC running a PARSEC application. Similarobservations have been reported for different PARSEC applications [Bogdan and Xue2015]. From these spectra, we can conclude many realistic NoC applications indeedhave multifractal characteristics. A set of exponents can better characterize the trafficbehaviors instead of a single Hurst parameter.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 12: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:12 Z. Qian et al.

2.3.5. Modeling the system dynamics using mean-field approach. To address the non-stationary property, a mean field (MF) traffic modelling technique is used in [Bogdanand Marculescu 2011; 2010]. In summary of their approach, the packet transmissionin NoC is modelled using a random graph (RG). In a RG, the nodes represent the bufferchannels in the router and the edges represent the application packets that are trans-ferred between the RG nodes. At run time, the IN and OUT degree of each node in theRG correspond to the number of incoming and outgoing customers (packets), respec-tively. Of note, the randomness of the graph refers to the connections in the graph willchange dynamically depending on the arrival and departure of packets at the channel[Bogdan and Marculescu 2011]. Based on these concepts, the authors derive that theIN degree of nodes in RG satisfies the following equation [Bogdan and Marculescu2010]:

∂Ni

∂t=

pηi(t)

Mi(t)f1(Ni−1, Ni)−

rθi(t)

Zi(t)f2(Ni, Ni+1) (5)

where Ni(t) represents the average number of buffer channels at time t whose INdegree equals to i (i.e., having i arrival packets). In Eqn. 5, the first term considersthe effects of packets arriving at a buffer channel. Specifically, by the definition of theRG, an edge is connected to a node if a packet reaches the corresponding buffer node.Therefore, at time t, if a packet arrives at any one of the Ni−1 nodes which have anIN degree i − 1, then Ni should increase by one at t + δt 3. On the other hand, if apacket arrives at any one of Ni nodes that already have i packets, Ni will decreaseby one and Ni+1 is augmented by one as a result. Thus, the function f1(Ni−1, Ni) inEqn. 5 models the packet arrival effects by first adding the number of new nodes inRG that enter into Ni state from Ni−1 and then subtracting the number of nodes thatleave from Ni to Ni+1. In Eqn. 5, pηi(t)/Mi(t) is a specific time-dependent functionwhich denotes the probability of a new packet arriving at one node in RG at timeinstance t. Similar to the first part of Eqn. 5, the second term in Eqn. 5 considersthe effects of packets departing from the current router channel. rθi(t)/Zi(t) in Eqn. 5represents a time-dependent probability that a packet will leave the current node attime instance t. The function f2(Ni, Ni+1) considers two cases: if a packet leaves nodewith IN degree Ni+1, then Ni increases; on contrary, if a packet leaves node with INdegree Ni, then Ni decreases. In pηi(t)/Mi(t) and rθi(t)/Zi(t) of Eqn. 5, the factors p andr represent the probabilities of a packet arriving at or departing from a buffer channelin RG. They can be pre-characterized from the obtained packet injection traces and therouting algorithms; Mi(t) and Zi(t) are two normalization parameters for pηi(t) andrθi(t) which are given in [Bogdan and Marculescu 2011; 2010]. Finally, the two time-dependent fitness functions ηi(t) and θi(t) play the key roles on fitting the final systemtraffic. In [Bogdan and Marculescu 2009], these two fitness functions are developed inanalogy to the energy fitness functions in statistical physics.

Based on Eqn. 5, given the pre-assumed functions η and θ, the time-dependent prob-ability P (i, t|η, θ) which describes the possibility of finding a node in RG that has anIN degree i at time instance t, can be calculated as [Bogdan and Marculescu 2010]:

∂[tβP (i, t|η, θ)]∂t

= [η

M+

θ

Z]∂2[iP (i, t|η, θ)]

∂i2+ [

θ

Z− η

M]∂[iP (i, t|η, θ)]

∂i(6)

For β = 0, Eqn. 6 is reduced to the form of a conventional Poisson traffic model. Onthe other hand, when β ̸= 0, Eqn. 6 is able to represent a statistical process whichdisplays multifractal characteristics.

3Assume δt is the infinitesimal time interval

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 13: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:13

Memoryless or Short-range dependent

rx(l)

l

• Poisson traffic (random/ transpose pattern with Poisson injection etc .)

• Batch injection traffic

Long range dependent

rx(l)

l

Monofractal

h

D(h) • Markovian Modulated Poisson Process (MMPP)

• Constant Hurst parameter model

Multifractal

h

D(h) • Realistic benchmarks (PARSEC, SPEC etc.)

• Variable Hurst parameter model

App

licat

ion

Fig. 6. Summary of NoC traffic models. The short-range and long-range dependencies are identified by theautocorrelation function rX(l) defined in Table I. The mono- and multi-fractal traffic differ in the singularityspectrum D(h), which is a function of singularity exponent h [Ihlen 2012].

By solving the equations as Eqns 5 and 6, users can predict the system dynamicswhich evolve with time. This consequently enables a more accurate run-time controlover the system. For example, in [Bogdan and Xue 2015] and [Bogdan 2015], the au-thors demonstrate a framework using model predictive control (MPC) for run-timepower management. The multi-fractal workload model and system dynamic equationsare developed first. Then, based on the predictions on the system dynamics, the con-troller aims to adjust the voltage/frequency of each tile at run time to reduce the powerconsumption. The authors compared the multifractal control with Poisson and mono-fractal approaches. It is observed the multifractal control framework accurately cap-tures the system dynamics and therefore significantly saves system energy.

2.3.6. Summary of the traffic analysis techniques. In Fig. 6, we summarize different NoCtraffic models discussed in previous sub-sections. In general, we can first compute theautocorrelation function rX(l) of a time sequence. For memoryless or short range de-pendent traffic, rX(l) rapidly reduces as l increases. On the other hand, for long rangedependent traffic, rX(l) has a long tail. The synthetic workload with Poisson injec-tion process is an example of memoryless traffic. For such traffic, conventional M/M/1queuing model or diffusion approximation approaches [Kobayashi 1974] can be usedin the performance analysis. For the long-range dependent (LRD) traffic model, weneed to further consider the mono- and multi-fractal property by comparing the multi-fractal spectrum D(h). In summary, the Hurst-parameter-based and phase-type-basedmodels discussed in Section 2 use a single exponent to characterize the workload. Theybelong to the mono-fractal model. On the other hand, many real applications such asthose in PARSEC and SPEC benchmarks have a more complex multi-fractal behavioras shown in Fig. 5. One limitation of current multi-fractal traffic model is that mostanalytical performance models cannot directly take it as input. Therefore, a new per-formance analysis framework supporting multi-fractal traffic and can be embedded inthe design space exploration is required.

3. ANALYTICAL MODELS FOR NOCIn this section, we first investigate the queuing-theory-based models for mean end-to-end delay prediction in NoC. Then, we review the approaches to derive the maximumdelay bounds. For a clear presentation, the parameters used in the models and the cor-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 14: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:14 Z. Qian et al.

Table II. Parameters in the analytical performance models [Qian et al. 2015; Ben-Itzhak et al. 2011]

Notation Description and DefinitionLs,d Delay of a flow from the Processing Element (PE) s to PE dv/vs Waiting times at the injection buffer of PE sηs,d Total time spent to transfer a flit from PE s to PE dhs,d Accumulated contention delays for the flow which is from source s to sink d

lfi The ith channel that flow f traverses sequentially during the routingλf The aggregated flow arrival rate (packets/cycle)Fl The set that contains all the flows routing over the link l

d(f, l, i) The ith downstream channel which f traverses starting from the link ls/sl Average service time of all packets traversing the channel lsflitl Average flit service time in the buffer of channel lR2

l Second moment (SCV) of the packet service time for link l

ulfi

Maximum duration of packets in flow f leaves link lfi

a)Crossbar

RCSwitch

Allocator

North

East

Localb)

North input channel

flow to south

flow to east

flow to local

Weighted average of service time

Equivalent Server

Credit

North input channel

Fig. 7. Extracting an equivalent queuing system for each channel in NoC routers a) a typical single channelwormhole router architecture [Dally and Towles 2003] b) each input port channel can be treated as a queuingsystem by weighted averaging the service time of flows routing towards different output directions [Ograset al. 2010; Hu and Kleinrock 1997; Kiasari et al. 2013b]

responding notations are summarized in Table II. Of note, these notations are similarto those used in previous works such as [Hu and Kleinrock 1997; Ben-Itzhak et al.2011; Qian et al. 2015].

3.1. Average-case performance evaluationFor most multi-core systems which do not have a deadline requirement (e.g., a general-purpose computing platform that provides best-effort services), the average-case met-ric is used for design space exploration [Ogras et al. 2010]. To evaluate the mean la-tency, the queuing theory-based models [Lysne 1998; Ogras et al. 2010; Hu and Klein-rock 1997; Fischer and Fettweis 2013] are most widely used. In the following, we sum-marize the basic ideas of queuing models.

3.1.1. The taxonomy of the queuing models. In Fig. 7, we first illustrate how to extract aqueuing system from an NoC router during the queueing analysis. Specifically, each in-put buffer channel (e.g., the north input buffer highlighted in Fig. 7-a) is abstracted asa queuing system. The customers of this queueing system are packets or flits being pro-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 15: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:15

cessed in the router. Because every customer at the current link may be routed to dif-ferent output directions, the service process is usually approximated by the weightedaverage of the service times towards each output direction, where the weights arethe amount of traffic (shown in Fig. 7-b) [Ogras et al. 2010; Kiasari et al. 2013b]. Forsuch queueing system, the Kendall’s representation [Kiasari et al. 2013a; G. Bolch andTrivedi 2006] uses the "A/B/m/K" abbreviation to characterize the system as follows:

— "A" represents the arrival process: For example, the abbreviation "M" of "A" used inthe queuing model stands for the Markovian arrival (i.e., Poisson arrival) process."Er" stands for Erlange arrival while "G" corresponds to a general independent IATdistribution. Of note, some bursty arrival processes reviewed in previous sectionshave also been considered in NoC queuing models. Examples are the MMPP arrivalprocess in [Min and Ould-Khaoua 2004] and the "GE" (generalized exponential dis-tribution) arrival process in [Qian et al. 2014; Wu et al. 2010].

— "B" denotes the system service process. Similar to the arrival process, "B" can bemarkovian with the abbreviation "M", deterministic with the abbreviation "D" or amore general process with the abbreviation "G".

— "m" is the total number of available servers in the queuing system. For wormholeNoCs with a single channel per port 4, since there is only one physical channel in thedownstream router to serve the packets, the queuing system has only one server andm = 1 [Hu and Kleinrock 1997]. On the other hand, for NoCs with multiple virtualchannels, the packets actually contend for one of the available VCs in the routing.Therefore, the effective number of servers will be larger than one depending on theavailable VCs [Ben-Itzhak et al. 2011].

— "K" represents the number of customers that can be held in the queuing system.Existing NoC analytical models define the customer in the queuing system as eithera single packet or a flit. Consequently, "K" is calculated from the router buffer sizebased on the customer granularity.

In queuing models, when computing the waiting time of a queue, the analysis proce-dure usually computes the state probability Pi first, which represents the probability ofhaving i customers (packets or flits) in the queuing system [Kiasari et al. 2013a]. Then,the average number of customers N in the system with capacity K can be calculatedas: [Kleinrock 1975; Donald and Harris 2008]: N =

∑Ki=0 i × Pi. Finally, according to

the Little’s Law [Kleinrock 1975], the average waiting time in the queue Ws equals to[Kleinrock 1975; Donald and Harris 2008]: Ws = N

λ , where λ is the average customerarrival rate at the channel.

In this survey, we summarize several representative NoC analytical models from thefollowing aspects:

— The arrival process: Most NoC analytical models consider the router models underthe assumption of Poisson arrival process. For example, in [Guz et al. 2007], an M/M/1based channel model is designed to analyze impact of link capacity on flow latency.Similarly, in [Lai et al. 2009; Nikitin and Cortadella 2009; Ben-Itzhak et al. 2011], theIAT distribution of the header flits in each flow is assumed to be Poisson. Recently,much research efforts have been spent on generalizing the arrival process model. In[Wu et al. 2010], the traffic burstiness as well as the short-range dependencies (SRD)are characterized using a generalized exponential (GE) distribution [Wu et al. 2010].In [Kiasari et al. 2013b], instead of using a GE arrival model, a 2-state MMPP model

4In this section, we use the notation "Single Channel" to represent the wormhole NoC routers with a singlechannel at each port, while the noation "Multiple VC" represents the NoC routers with multiple virtualchannels.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 16: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:16 Z. Qian et al.

Table III. Comparison of NoC analytical models [Qian et al. 2015; Qian et al. 2014]

Queuing theory based analytical models[Guz et al.2007]

[Ogras et al.2010; Laiet al. 2009]

[Kiasari et al.2013b]

[Ben-Itzhaket al. 2011]

[Wu et al.2010]

[Min andOuld-Khaoua2004]

Queuemodel

M/M/1 M/G/1/K G/G/1/∞ M/M/m/K Ge/G/1/∞ MMPP/G/1/∞

Queuing model summaryArrivalprocess

Poisson Poisson 2-stateMMPP

Poisson Generalexponential

MMPP

Serviceprocess

Memoryless Generalindependent

Generalindependent

Memoryless Generalindependent

Generalindependent

Trafficpattern

Arbitrary Arbitrary Arbitrary Arbitrary Uniform Uniform

Supported router architectureRouter ar-chitecture

SingleChannel

SingleChannel/Multiple VC

SingleChannel

SingleChannel/Multiple VC

SingleChannel

SingleChannel/Multiple VC

Buffersize

Negligible K packets(K > 1)

B flits(B > 0)

Negligible Negligible Negligible

Arbitrationscheme

Round-robin Round-robin Fixedpriority

Round-robin Round-robin Round-robin

is used to capture the bursty arrivals at the injection sources. Specifically, in thatwork, a second moment term, namely the "squared coefficient of variance" (SCV)is computed to characterize the bursty traffic. In [Min and Ould-Khaoua 2004], amultiple-state MMPP traffic model is employed to better represent the self-similarityin the application traffic. Then the MMPP/G/1 queuing model is used to analyze thelatency performance for a specific topology (hyper-cubes) in supercomputers.

— The service process: Under the assumption that packet sizes follow the negative ex-ponential distribution, the service time of the packets can also be approximated asexponentially distributed [Kleinrock 1975]. Based on this assumption, several worksuse the memoryless service time models to predict the flow delays under differentlink capacities or buffer sizes [Guz et al. 2007; Hu and Marculescu 2004a]. On theother hand, there are also many multi-core applications which use constant packetsize [Nikitin and Cortadella 2009] or a more general packet length distribution [Ki-asari et al. 2013b]. In order to work for a constant packet length distribution, in[Nikitin and Cortadella 2009], a modified M/D/1 queuing model is proposed to ad-dress the non-memoryless service time distribution. In order to model a general in-dependent service time distribution, in [Ogras et al. 2010] and [Kiasari et al. 2013b],the M/G/1 and G/G/1 queue-based models are used respectively. More specifically, thesecond moments of the service times are calculated in these models before applyingthe M/G/1 and G/G/1 queuing formula.

— The number of servers: Most work assumes a single server at each input port. Inmultiple-VC router architectures, since the upstream packets can request the usageof any one of the available VCs at the downstream node. Therefore, instead of usinga single server queuing system, [Ben-Itzhak et al. 2011] proposed to use a multiple-server model.

— The system capacity calculation: Many existing analytical models assume a singleflit buffer [Ben-Itzhak et al. 2011] or the buffer size is negligible [Fischer and Fet-tweis 2013; Guz et al. 2007]. Under these assumptions, a packet actually occupies thewhole routing path during its transfer [Ben-Itzhak et al. 2011]. Therefore, for eachintermediate link channel in the routing path, the contending packets at the linkchannel are actually accumulated at the injection sources. Since the source queue isusually assumed to have infinite capacity [Ben-Itzhak et al. 2011], the queueing for-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 17: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:17

mula with infinite capacity (e.g., M/M/1/∞ or M/G/1/∞ queue) are used to analyze thelink channels [Ben-Itzhak et al. 2011]. On the other hand, there are also analyticalmodels such as [Lai et al. 2009; Ogras et al. 2010] which are developed assuming asingle channel buffer can hold N packets at one time. N is an integer larger than one.Then, the queueing system capacity is modelled as N . The derivation of the queuingcapacity becomes more complicated if the NoC router has a finite size buffer (e.g.,several flits) and can only hold a portion of packet in its buffer. In [Hu and Kleinrock1997; Kouvatsos et al. 2005], the authors develop techniques to calculate the queuingcapacity under this situation. Specifically, in their approaches, for any link channel,its capacity is the summation of two parts. The first part represents the number ofcontention flows sending towards this link; the second part computes the averagepackets that are stored at the input port, which is further calculated from the packetlength distribution.

In Table III, we summarize several representative queuing-theory-based models inNoC. As discussed above, each model relies on its specific assumptions and is mostaccurate when the assumptions capture the target application and router architecturecharacteristics.

Based on the models proposed in [Qian et al. 2015; Ogras et al. 2010; Hu and Klein-rock 1997; Kiasari et al. 2013b], we summarize a typical procedure of applying queu-ing theory to predict NoC average latency. For the sake of simplicity, as in [Nikitin andCortadella 2009; Qian et al. 2014], we assume each packet has a fixed size of L flits.Interested readers can refer to [Hu and Kleinrock 1997; Kouvatsos et al. 2005; Arjo-mand and Sarbazi-Azad 2010] for additional modifications that are needed to extendthe model from fixed-length assumption to a general packet size distribution.

3.1.2. Summary of latency calculation for an NoC flow. In typical NoC analytical models, thedelay Ls,d (shown in Fig. 8-a and -b) is used to represent the overall packet transfertime from source s to destination d. It is broken down into three parts [Ben-Itzhaket al. 2011; Qian et al. 2014]: 1) the waiting time at the injection processor s (i.e., vs),2) the packet transfer time (i.e., ηs,d) after being allocated the channel and 3) the pathcontention delay (i.e., hs,d). More specifically, it is expressed in these works as:

Ls,d = vs + ηs,d + hs,d (7)To compute the path contention delay hs,d, it is required to aggregate the delays of

each specific channel along the path of f (Fig 8-b illustrates the channels that shouldbe considered for the flow f6,2). Therefore, in [Ben-Itzhak et al. 2011], hs,d =

∑df

i=1 hlfi,

where lfi is the i− th link of flow f and df is the path length.Similarly, the packet transfer time ηs,d can be calculated as [Ben-Itzhak et al. 2011;

Ogras et al. 2010]: ηs,d =∑df

i=1 ηlfi+ (L − 1), where the transfer time of the header

flit is added across every link, then the second term represents the absorption of theremaining (L− 1) flits at the sink PE.

In summary, to derive Ls,d, the packet competition times h and the packet transfertimes η at every link channel should be computed. The queuing models should addresstwo issues: i) First, we need to analyze the dependency among the channels and decidethe orders of links for applying queueing formula [Hu and Kleinrock 1997; Qian et al.2014]. ii) Second, it is important to extract equivalent queuing systems for each linkchannel and clearly identify the arrival and service processes of the queueing system.Particularly, when modeling the arrival and service processes, we need to differentiatebetween the Single Channel and Multiple VC routers, which are shown in Fig. 8-c and-d, respectively. In section 3.1.3, we summarize the analytical procedures for Single

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 18: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:18 Z. Qian et al.

0 1 2

3 4 5

6 7 8

6

0

7

32

f6,2

v6

Local

(a) (b) 5

l6,7 l7,8

l5,2

l8,5

(c)

h

η

current flow

contention flow

North

East

Local (d)

NorthVC0

R

VCi

East

VC0

VCi

Multiple flowsR R R

R R R

R R R

Fig. 8. An illustration of the queueing delay calculation [Qian et al. 2014]: a) an application representedas an architecture characterization graph (ARCG), b) the link channels which form the routing path of flowf6,2, c) the illustration of path competition time h, the flit transfer delay η and d) flow sharing effects inMultiple VC router architectures [Ben-Itzhak et al. 2011].

Total buffer slots of L flits (a packet size)

Point APoint B

η

Point C

contention flow

Point D

North

East

Local

current flow

contention delay h

R1 R2 R3 R4

Fig. 9. An illustration of the channel service scenario for a packet in the east link channel of a SingleChannel wormhole router architecture (c.f. [Hu and Kleinrock 1997; Kouvatsos et al. 2005; Qian et al. 2014])

Channel routers. In section 3.1.4, we discuss the extensions that are required to modelMultiple VCs.

To resolve the first issue of channel dependencies, it is realized the enforced depen-dency is due to the flow control between NoC routers. For example, as shown in Fig. 8,both links l6,7 and l7,8 are part of the routing path of flow f6,2. The waiting times of l7,8affect the packets transmission at channel l6,7. There are several methods to addressthis link dependency issue. For example, in [Kiasari et al. 2013b], a channel indexingmethod is proposed based on the distances to the destinations. In [Lai et al. 2009], achannel sorting method is proposed based on the routing algorithm used. A more com-monly used method is to build the link dependency graph (LDG) of the application. Aproper link order ensures when analyzing a vertex (i.e., a link channel) in LDG, all itsprecedents have been computed. For more details, the readers are suggested to refer to[Hu and Kleinrock 1997; Foroutan et al. 2013; Arjomand and Sarbazi-Azad 2010; Qianet al. 2014], where the procedures of building LDG and deriving the link orders fromLDG are provided.

3.1.3. Single Channel Wormhole Router models . In the following, we summarize a typicalprocedure to calculate the competition delay h, the transfer delay η and the sourcequeue waiting delay vs for Single Channel wormhole router architectures.

1) Flit transmission delay queueing model: In [Ben-Itzhak et al. 2011; Qian et al.2015], the flit transfer time η over link l corresponds to the time cost to send the flitfrom the input port buffer to the head of the buffer at the next router, assuming the flithas been allocated to use that link. To be more specific, in Fig. 9, the flit transfer time

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 19: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:19

over the link connecting router R1 and R2 is depicted as the time duration from PointA to D. In [Qian et al. 2014], it was further broken it into two parts for calculation. Thefirst part consists of the time needed to traverse the current router. In a pipelined NoCrouter, this time can be approximated as the depth of router pipelines. After leavingthe router, the second component in η calculates the transfer delay of a flit to move tothe next channel (i.e., moving from Point C to Point D in Fig. 9).

In order to calculate the transfer time from Point C to D, [Qian et al. 2015] proposeda queuing model whose capacity equals to B+1. Specifically, B is the router buffer sizedefined in Table II. The additional one in the capacity B+1 considers the flit location atthe upstream buffer where the request to the current link is generated and granted. Inorder to apply the queueing formula, the arrival and service process of this equivalentqueueing system are identified. The mean customer (i.e., flit) arrival rate is computedby aggregating that of every flow in set Fl [Ogras et al. 2010]: λflit

l = L ×∑

f∈Flλf ,

where λf is the mean rate of flow f , L is the packet size. The mean service time isapproximated as the weighted average of different flows f passing through the link l[Hu and Kleinrock 1997]. Specifically, it is calculated as [Qian et al. 2015]:

sflitl =

∑∀f∈Fl

[λf × (hd(f,l,1)

L + 11−Pbd(f,l,1)

)]∑∀f∈Fl

λf(8)

where d(f, l, 1) represents the downstream neighboring channel of link l regardingto flow f . To understand Eqn. 8, we can consider the service process of a specific packetduring the routing. This service process always starts with the downstream channelarbitration from the header flit in packet. In Eqn.8, it assumes the header flit takeshd(f,l,1) cycles to be allocated the downstream link. Then, for the body and tail flits,due to the Single Channel router architecture, they can use the link channel for trans-mission exclusively. Therefore, they are sent out smoothly without any further delay.The only exceptional case which interrupts the smooth transmission is due to the flowcontrol. Specifically, if the downstream channel indicates its buffer is full, the body/tailflits have to be stalled. Let Pbd(f,l,1) represent the probability of having a full buffer atlink d(f, l, 1). Then, similar to derivation in [Lai et al. 2009], the mean time to route aflit under Pbd(f,l,1) can be calculated as 1

1−Pbd(f,l,1). Hence, the overall service time for

a flit belonging to flow f is the average of different flits in the same packet and can becomputed as hd(f,l,1)

L + 11−Pbd(f,l,1)

in Eqn. 8.After modelling the queuing system in the flit transfer process, the corresponding

queuing formula such as M/M/1/K in [Ben-Itzhak et al. 2011] or M/G/1/K in [Hu andKleinrock 1997; Lai et al. 2009] can be used to evaluate the system waiting time. Thederived waiting time then provides an estimation of flit transmission time η.

2) Path competition delay queueing model: The arbitration or flow competition delayh is defined in [Ben-Itzhak et al. 2011] as the time for a packet header to be successfullygranted its next channel over other competitors. In general, we can also abstract aqueuing system for this arbitration process [Ben-Itzhak et al. 2011; Hu and Kleinrock1997]. Various analytical models have been proposed. Examples are the G/G/1 [Kiasariet al. 2013b], M/G/1/K [Lai et al. 2009; Arjomand and Sarbazi-Azad 2009], GE/G/1[Wu et al. 2010] or MMPP/G/1 [Min and Ould-Khaoua 2004] queuing models. In thesequeuing models, the capacity K of the queuing system is usually modelled by the totalnumber of flows contending for the same output direction [Ben-Itzhak et al. 2011; Huand Kleinrock 1997].

Similar to the procedure of deriving η, the mean customer arrival rate of the queueextracted for calculating h is obtained by accumulating the traffic rate of all flows pass-ing the channel. In addition to the first moment, other higher moments of the arrival

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 20: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:20 Z. Qian et al.

process can also be computed depending on the specific traffic model employed in theanalysis. For example, if the G/G/1 [Kiasari et al. 2013b] or GE/G/1 [Qian et al. 2014]queuing model is used to derive the contention delay h, the second moments (i.e., SCVs)of the packet arriving process over the channel l is also required. The derivations ofthe input traffic SCV can be calculated following the approximations in [Kiasari et al.2013b; Qian et al. 2015].

To represent the service process of the queuing model, a widely used service timemodel is illustrated in Fig. 9 (c.f. [Hu and Kleinrock 1997; Qian et al. 2014]). In Fig.9, assuming we need to calculate the contention delay of a packet in the east inputchannel of router R1, the service of that packet begins right after its header flit in PointA being granted the next channel and finishes at the instance that the tail flit leavesthe buffer. This time duration is denoted as the service time because after that, otherflows requesting the usage of l can arbitrate again for the physical link. In this way,the waiting time of the extracted queueing system has the meaning of the contentiondelay to be allocated a channel. For a network under light/medium load, this serviceduration just equals to the packet size because each flit can pass through the channeldirectly; on the other extreme, under heavy traffic load (e.g., the downstream channelsare under severe congestion), this time is calculated as the duration of the header flitto move to the location (i.e., Point B in Fig. 9) where the entire packet can be storedacross that location and the current buffer slot (i.e., Point A) [Hu and Kleinrock 1997;Kouvatsos et al. 2005; Arjomand and Sarbazi-Azad 2010]. Let xf

l represent this worst-case service time, then the mean queue service time sfl with respect to flow f is a valuedictated by the length of packet L (minimum value) and xf

l (maximum value) [Hu andKleinrock 1997]. In practice, xf

l is calculated first by accumulating the waiting timesalong the channels from Point A to B; then, sfl is approximated based on L and xf

l as in[Hu and Kleinrock 1997; Qian et al. 2015]. After computing every sfl , the mean servicetime for packets traversing l can be computed by averaging over sfl [Hu and Kleinrock1997], i.e.,: sl =

∑∀f∈Fl

(λf × sfl )/∑

∀f∈Flλf .

Besides the mean value of the service time, for queuing models with non-memorylessservice time distribution, such as the M/G/1, G/G/1 and MMPP/G/1 models, the secondmoment term, i.e., SCV of the service process is also required. One way to approximateSCV is given in [Kiasari et al. 2013b; Qian et al. 2015]:

R2l =

(sfl )2

(sfl )2− 1 = (

∑∀f∈Fl

λf × (sfl )2∑

∀f∈Flλf

)/(sl)2 − 1 (9)

Based on above discussions, after mathematically building the models for the arrivaland service process in the queuing system, the M/G/1/K [Ben-Itzhak et al. 2011], G/G/1[Kiasari et al. 2013b] or GE/G/1 [Qian et al. 2015] queuing formula can then be appliedto evaluate the waiting times of the queueing system, i.e., the flow contention delay hfor the target link channel.

3) Waiting times at the source node: Typically, the traffic injection queues at the NoCnetwork interfaces (NIs) is modelled as a system whose capacity is infinite; this isbecause the inner memory bandwidth of the PEs are usually much higher than thatof the router channels, which therefore supports accumulating much more packetsat the sources before injecting them into the network [Dally and Towles 2003]. Tomodel the source waiting time, different queuing models can be applied. For example,[Ben-Itzhak et al. 2011] proposes to use an M/M/1/∞ queue and [Wu et al. 2010; Qianet al. 2014] uses the GE/G/1/∞ queuing model. Of note, for the source queues, the

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 21: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:21

traffic arrival and service processes are calculated in a way similar to that discussedin characterizing queuing systems for contention delay h.

3.1.4. Extensions to router architectures with multiple VCs . For Multiple VC router archi-tectures, the physical channel bandwidth is time-division multiplexed (TDM) amongseveral flows at the same port [Ben-Itzhak et al. 2011]. To model the VC router archi-tectures, two methods are widely used. The first way is to employ a multiple serversqueue to replace the single-server model in Single Channel routers. Specifically, in[Ben-Itzhak et al. 2011], when calculating the path contention delay h, an M/M/m/Kqueuing is used. The number of servers m in the model equals to the number of effec-tive VCs at the downstream channel. Moreover, the authors also propose to considerthe facts that when there are multiple flows being granted downstream VCs of thesame port, the flits actually share the same physical bandwidth together. Therefore,the flits appeared on the same link can come from several flows which is different fromthe case in Single Channel wormhole routers. To address this, the derivation of theflit transfer time η should also be modified. Towards this end, the concept of "effec-tive channel bandwidth" of a particular flow is introduced to represent the equivalentbandwidth seen by a VC during the switch transmission. With these two modifications,the path contention delay h and flit transfer time η should be able to incorporate theeffects of VC sharing. They can then be used to predict the mean flow latency as in theSingle Channel models.

The second way to model Multiple VC router architectures is to refine the obtainedmean packet waiting time from Single Channel model by a factor V̄ [Ould-Khaoua1999; Dally 1992]. The parameter V̄ represents the mean "degree of VC multiplexing"at a specific link channel [Ould-Khaoua 1999]. With the scaling factor V̄ , the meanflow delay from source tile s to destination tile d is rewritten as [Ould-Khaoua 1999]:Ls,d = (vs + ηs,d + hs,d) × V̄ . To calculate V̄ , a typical way is to employ the VC statetransition diagram (STD) [Kiasari et al. 2008; Wu et al. 2010]. In STD, each vertexVi represents a state that there are i VCs at the input port already being allocated tosome packets. Specifically, the weight on the edge of STD pointing from state Vi to Vi+1

denotes the probability that a new VC is allocated to other packets under the existenceof i busy VCs. Similarly, the weight on the edge pointing from Vi to Vi−1 reflects theprocess of finishing serving a packet and releasing one VC for further routing [Kiasariet al. 2008; Wu et al. 2010].

The transition rate between Vi and Vi+1 depends on the current channel state (i.e.,number of VCs being used) as well as the characterized arrival/service processes. In[Wu et al. 2010] and [Min and Ould-Khaoua 2004], the transition probability from Vi

to Vi+1 is derived according to an optimization algorithm that maximizes the overallentropy for the GE and MMPP type traffics; the transition rate from Vi to Vi−1 isdictated by the packet service rate, which is 1/s (s is the average time for servingflits at the current channel). After calculating the transition rates on STD, the stateprobability, Pv, (0 ≤ v ≤ V ), which represents the probability that there are v VCsbeing allocated, can be computed by solving the extracted Markov chain as in [Minand Ould-Khaoua 2004; Wu et al. 2010; Ould-Khaoua 1999]. Finally, the degree of VCmultiplexing V̄ is given by [Dally 1992]: V =

∑Vv=0 v2Pv∑Vv=0 vPv

3.2. Worst-case latency prediction for NoCFor many real-time systems such as the healthcare and embedded control related ap-plications, it is more important to guarantee the packets are received and the subse-quent actions are taken in time. For NoC systems, since most of the resources (e.g.,links and buffer channels) are shared by flows, the contentions introduce a large delay

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 22: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:22 Z. Qian et al.

variation, especially under heavy workload. Therefore, for these systems, one chal-lenge is to predict the worst-case delay accurately to avoid over-design and ensure aconfiguration can meet the hard/soft deadline requirement. In [Kiasari et al. 2013a],the authors have surveyed three worst-case performance analysis approaches, that arenetwork calculus based method, the dataflow analysis framework and the schedulabil-ity analysis. In this section, we first present another approach used in NoC community,i.e., real-time bound (RTB) analysis [Rahmati et al. 2013; Lee 2003; Rahmati et al.2009]. For the sake of completeness, we also highlight the concepts and ideas of theother three approaches.

3.2.1. Real-time bound analysis for maximum delay prediction. The rationale behind real-time bound (RTB) analysis is to predict the target flow’s latency under a scenario thatprovides least support for current packet transmission [Rahmati et al. 2009]. Specif-ically, this worst-case transmission of a flow f occurs when all the router channelslocating in f ′s routing path are full and flow f has the least priority to be allocated therouting resources when competing with other flows. Under this assumption, the pack-ets of f have to wait until all other flows have been served. Therefore, in [Rahmatiet al. 2009; 2013], the upper bound latency D̄f of flow f is re-written as:

D̄f = ts1 + ts2 +

df∑j=0

ulfilfi ∈ Pf (10)

where ts1 (ts2) represents the delay of issuing (collecting) a packet with multiple flitsinto (out of) the NoC and ulfi

is the maximum delay required for the packets in flow f to

leave the current channel lfi . Of note, to be consistent with the previous sections, alsofollowing the conventions in [Hu and Kleinrock 1997], here we use lfi to represent thei − th hop channel in the routing path of flow f . lfi+1 represents the next link channelin flow f ’s routing path.ulfi

is made up of two parts [Rahmati et al. 2009; 2013]. First, it includes the residual(remaining) time to route the packets which are already being served at the channelbuffer. These packets belong to either flow f or its competitor f∗. Due to backpressureor flow-control, this residual time corresponds to the maximum waiting times of thedownstream links: max(ulfi+1

, ulf∗i+1

), ∀f∗ ∈ Ilfi(f), where Ilfi

(f) denotes the set of flows

that contends with f at the current link lfi . lfi+1 and lf∗i+1 represent the downstreamchannel of flow f and f∗, respectively.

The second part in ulficonsiders the additional delay due to the loss of arbitration

under the least priority assumption. More specifically, after the remaining packetsleave the channel, all the flows compete to use that channel again. For the worst-casetransfer situation, the packet of current flow f has to wait for packets of all other flowsfinishing transmission once (i.e.,

∑f∗ ulf

∗i+1

) since it always loses the arbitration.Finally, the overall delay of ulfi

is given by adding these two parts together [Rahmatiet al. 2009]:

ulfi= max(ulfi+1

, ulf∗i+1

) +∑f∗

ulf

∗i+1

∀f∗ ∈ Ilfi(f) (11)

At the destination nodes, since they consume the packets immediately; both ulfi+1and

ulf∗i+1

in Eqn. 11 equal to the packet length. Therefore, in practice, ulfiis calculated from

the flow destinations backwards to the sources [Rahmati et al. 2009]. After obtainingulfi

, the worst-case delay bound D̄f can be calculated accordingly.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 23: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:23

3.2.2. Network calculus based worst-case delay prediction. Following the discussions in[Boudec and Thiran 2004; Kiasari et al. 2013a; Chang 2000], we can summarize the ra-tionale of network calculus as follows: for a queuing system, let I(t) and O(t) representits input and output characteristic function, respectively. More precisely, I(t) denotesthe total number of packets received at a specific router port until time t. O(t) rep-resents the number of packets sending out of that port until the same time instance.Under these notations, b(t) = I(t)− O(t) actually represents the number of customersthat have to wait in the buffer at time t; while d(t) = infτ≥0{I(t) ≤ O(t+ τ)} describesthe time duration that a customer has to stay in the system before it is routed to thenext node under the assumption this customer arrives at time t [Qian et al. 2009a;Kiasari et al. 2013a].

For the worst-case delay derivation, the network calculus approaches use two curvesto bound the input and output characteristic functions, namely the arrival curve α(t)and the service curve β(t) [Kiasari et al. 2013a]. α(t) and β(t) bound the arrival func-tion I(t) and output function O(t) as follows [Chang 2000; Qian et al. 2009a]:

I(t)− I(s) ≤ α× (t− s), ∀s ≤ t (12)

O(t) ≥ I(s) + β × (t− s), ∀s ≤ t (13)Giving the characterized α(t) and β(t) curves, the maximum delay of forwarding acustomer is obtained by computing the maximum horizontal distance between β(t)and α(t) curves over all the time instances t [Kiasari et al. 2013a; Chang 2000].

To calculate the maximum routing delay of a specific flow f in NoC, all the routersthrough which flow f passes are considered first. Then, an equivalent service curve βf

of this flow is derived by convolving each router’s service process [Qian et al. 2009a].To resolve different contention scenarios that a flow may encounter, in [Qian et al.2010b], the authors propose a systematic way to build a contention tree. The flowscompeting with the current flow f at each router are recognized. Then, the contentionscenarios are classified into three typical cases. The equivalent service curve is derivedrespectively for these three cases. The model is further optimized in [Qian et al. 2009b]by considering the feedback between the neighboring routers using credits for flowcontrol.

Recently, the network-calculus based models have evolved in the following direc-tions:

— Analyzing self-similar traffic: In [Qian et al. 2009c], the network calculus based tech-niques are applied to estimate the delay under self-similar traffic. Specifically, in-stead of using a deterministic arrival curve, the self-similar arrival curve is derivedby including a few additional parameters (namely the excess probability in that work)to capture the LRD in the traffic. By using the new arrival curve model, the authorsdemonstrate a more practical delay bound can be obtained for fractal applications.

— Modeling for more complicated router architectures: in [Qian et al. 2010a], the net-work calculus approach is applied for the VC router architectures by further consider-ing the service process of VC allocation. In [Zhao and Lu 2013], the NoC elements aremodelled using a formal description approach called eXecutable Micro-ArchiecturalSpecification (xMAS) [Chatterjee et al. 2012]. Then the xMAS NoC models are an-alyzed using network calculus techniques to compute the delay bound of each flow.Compared to the previous models, the xMAS model takes the advantage of the xMASframework to integrate more control details in the NoC routers.

— Applying real-time calculus techniques: Real time calculus [Thiele et al. 2000] isproposed to extend the network calculus idea on the real time systems. Comparedto the conventional network calculus approaches, the upper and lower bound curveswhich are the functions of the time interval, are used to characterize the arrival

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 24: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:24 Z. Qian et al.

and service processes, respectively [Kiasari et al. 2013a]. For instance, the upperbound arrival curve with a parameter δ indicates the maximum number of arrivingcustomers during the time interval δ. Based on the upper and lower bound curves,[Chakraborty et al. 2003] develops a framework to analyze the maximum delay forthe real-time embedded systems.

— Predicting statistical delay distribution: Recently, the stochastic network calculus[Jiang and Liu 2008] technique has been applied in NoC to predict the statisticaldelay bound for NoC systems [Lu et al. 2014]. Specifically, the original deterministicarrival and service curves have been extended with an additional function to describethe probability of bounding the actual arrivals and services [Kiasari et al. 2013a]. Inthis way, the delay distribution of a target flow can be predicted using the statisticalarrival and service curves; the result is more useful for systems with soft-deadlinerequirements.

3.2.3. Schedulability and data flow analysis techniques. Besides above two approaches, theschedulability analysis and data flow analysis techniques have also been applied inNoC-based systems (e.g., [Shi and Burns 2008; Bekooij et al. 2004]). In general, thesetwo approaches are more often used in evaluating whether a scheduling/priority as-signment policy can satisfy the deadline requirement of real-time systems. For theschedulability analysis in [Shi and Burns 2008], the authors identified two types ofinterferences of a target flow f . Specifically, the direct interfering flows share the linkswith f and cause resource contentions. The indirect interferences only implicitly af-fect f via influencing the issuing time of packets in f ’s direct interferences. Based onthe priority assignment and the preemption assumption, a worst case delay can be de-rived accordingly. Such analysis framework is used to explore an optimal (i.e., lowesthardware cost) task mapping and priority assignment for NoC-systems with deadlineconstraints [Shi and Burns 2010]. For the data flow analysis, the data flow graph ofthe system is built first. Then, it can be used to evaluate whether a given buffer sizingor scheduling approach meet all the flow deadline requirements [Bekooij et al. 2004;Hansson et al. 2008].

4. SIMULATION BASED EVALUATION METHODSBesides mathematical models, NoC simulations are also important in performanceevaluations due to their higher accuracy. In this section, we review the development ofsimulation-based evaluation methods in NoC.

4.1. NoC simulators designWhen using an NoC simulator to evaluate a design, the user needs to choose one witha proper abstraction level. Generally, the most accurate simulator implements everycomponent (e.g., crossbar switch, switch allocators etc.) in the router exactly follow-ing the guidelines to be laid out. Netmaker [Netmaker 2009] is such a simulationplatform which is implemented in SystemVerilog and can be synthesized in ASIC orFPGA. It supports a variety of router and network configurations (e.g., routers andlinks with variable numbers of pipeline stages). It also supports some advanced routerstructures such as the speculative router [Peh and Dally 2001]. Another example isCONNECT [Connect 2011], which is a flexible Bluespec SystemVerilog based NoC em-ulator supporting different configurations in architectures (e.g., input queued, virtualoutput queued or virtual channel based routers). Moreover, the CONNECT users canalso configure the network into different topologies such as Mesh, Torus, Fat Tree andButterfly. In general, the hardware description language (HDL) based platforms aremost suitable for the final stage system prototyping. When such simulators are usedin the design space exploration, it takes a significant time to re-synthesize the designs

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 25: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:25

Table IV. Summary of current NoC simulators

Topology Router Backpressure TrafficNetmaker Mesh VC; SPECa Credit Synthetic

CONNECT Mesh/Torus/Fat Tree VC; IQ; VOQ b Credit/Peek Flow Control SyntheticBooksim Mesh/Torus etc. VC Credit Synthetic/Task-graph/TracesNoxim Mesh WHc ABPd Sythetic/Task-graph

WormSim Mesh/Torus WH Buffer availability Synthetic/Task-graph/TracesHNoC Regular/Irregular HVCe Credit Synthetic/Task-graph/TracesNNSE Mesh/Torus VC Stress Values Uniform/Locality/Task-graph

gpNoCsim Mesh/Torus/Fat Tree VC - SyntheticTOPAZ Mesh/Torus/Midimew VC, VCTf - Synthetic/TracesGarnet Regular/Irregular VC, EVC Credit/Buffer availability Synthetic

gMemNoCsim Mesh/Ring/Custom VC Credit/Stop and go TracesXmulator Regular/Semi-regular VC - SyntheticDARSIM Regular/Irregular VC - Synthetic/Traces

NoCTweak Mesh/Torus/Ring VC; BFLg Credit Synthetic/TracesXpipes Application-specific VC ACK-NACK/Stall and goh Task-graph

OCIN_TSIM Mesh/Star VC Credit Synthetic/TracesHORNET Regular/Irregular VC/BDLi - Synthetic/Traces

aSPEC: the Speculative router architecture [Peh and Dally 2001]bIQ: Input queued router; VOQ: Virtual output queued routercWH: the wormhole router without using virtual channelsdABP: Alternating Bit Protocol for flow control [Noxim 2011]eHVC: Heterogeneous Virtual channel architecture. The VCs of each port may be different.fVCT: Virtual Cut Through router architecturegBFL: Bufferless router architecturehXpipes Lite [Stergiou et al. 2005] supports the stall and go protocol [Flich and Bertozzi 2010].iBDL: NoC architecture with Bi-directional Links

and re-run the simulations especially when the data- or control-path components stillneed to be largely explored.

To speedup the evaluation, the other type of NoC simulators abstracts the routercomponents in a higher level language (e.g., SystemC or C++) and provides cycle- andflit-accurate evaluations of the target design. Compared to hardware description lan-guages, more flexible data types and structures can be used in these models. Therefore,many router components such as buffers and allocators can be implemented more ef-ficiently. The simulations are then performed either cycle-by-cycle (i.e., cycle-driven)or use an event queue to maintain the next operation (i.e., event-driven) [Dally andTowles 2003]. Another feature of these simulators is the parametrized design, whichallows the users to explore over a large design space such as the buffer sizes, linkbandwidth, routing schemes without re-compiling the simulators.

In Table IV and V, several widely used NoC simulation tools in the research com-munity are summarized and compared. They are Netmaker [Netmaker 2009], CON-NECT [Papamichael and Hoe 2012], Booksim [Jiang et al. 2013], Noxim [Noxim 2011],Wormsim [Wormsim 2008], HNoC [Ben-Itzhak et al. 2012; HNoC 2013], NNSE [Luet al. 2005], gpNoCsim [Hossain et al. 2007], TOPAZ [Abad et al. 2012], Garnet [Agar-wal et al. 2009], gMemNoCsim [gMemNoCsim 2011; Lodde and Flich 2012], Xmulator[Nayebi et al. 2007], DARSIM [Lis et al. 2010], NoCTweak [Tran and Baas 2012],Xpipes [Bertozzi and Benini 2004; Stergiou et al. 2005], OCIN_TSIM [Prabhu 2010]and HORNET [Lis et al. 2011] 5. Specifically, in Table IV, we summarized the topolo-gies, the router architectures, the backpressure signal characteristics and the traffic

5Actually,there are more open-source simulators that are available and have been used in NoC evaluations,examples are the SICOSYS [Puente et al. 2002], OCCN [OCCN 2003], Nigram [NIRGAM 2007] and Atlas[Atlas 2011]. They share many similar features as those in Table IV and V. Interested readers may refer to[Ababei et al. 2012], which summarized the information of simulators used by different groups.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 26: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:26 Z. Qian et al.

Table V. Summary of current NoC simulators (Cont.)

Parameters Modeling Language Parallel sim Embedded in FSSa

Netmaker Packet/VCb/Allocatorc etc. SystemVerilog - -CONNECT VC/Allocator etc. Bluespec SystemVerilog - -

Booksim Pipelined/VC/Allocator/Packet etc. C++ -√

Noxim Buffer/Packet etc. SystemC -√

WormSim Buffer/Packet etc. C++ - -HNoC VC/Link etc. OMNeT++

√-

NNSE VC/Packet etc. SystemC - -gpNoCsim Packet/VC etc. Java - -

TOPAZ Pipeline/Packet/VC etc. C++√ √

Garnet Pipeline/VC/Bandwidth etc. C++ -√

gMemNoCsim VC/Packet etc. C/C++ -√

Xmulator VC/Message etc. C# - -DARSIM VC/Packet/Bandwidth etc. C++

√ √

NoCTweak VC/Pipeline/Allocator etc. SystemC - -Xpipes VC/Switchese etc. SystemC/Verilog -

OCIN_TSIM VC/Pipeline/Allocator etc. C++ - -HORNET VC/Packet/Bandwidth etc. C/C++

√ √

aFSS: Full-system simulator. The entries with√

represent the simulator has been embedded in some full-system simulator,such as Simics [Simics 2012] , Gem5 [Gem5 2009] and Graphite [Graphite 2010] for execution-driven simulationbThe virtual channel number, buffer depth can be specified by the usercThe allocator and arbiter type can be configureddThe pipeline depth can be specifiedeThe switch (crossbar) size can be configured

supported by these simulators. As shown in Table IV, most simulators support theexploration of regular or semi-regular topologies (e.g., mesh, torus and hyper-cubes).These topologies are more commonly used in Chip Multiprocessor (CMP) architectures,where the processing elements are homogeneous. On the other hand, there are othersimulation frameworks such as Xpipes and HNoC, which support arbitrary topologiesand heterogeneous configuration of different routers. Therefore, they are suitable toexplore the design of Multiprocessor System-on-Chip (MPSoC) architectures. In TableV, we summarize the features of different simulation engines, including their config-urable parameters, the modeling language, whether they provide the native supportof parallel simulation and can be directly embedded in a full-system simulator. For theconfigurable parameters, most simulators allow the user to specify the VC (e.g.,numberof VCs per port and VC buffer depth) and the packet (e.g., packet size, fixed- or vari-able length distributions) configurations. Some simulators also provide the options forusers to explore the router architectures (e.g., allocator type and pipeline depth). Forthe power evaluation, two methods are widely used. The first is to use bit-level energymodel which estimates the energy consumption of transferring a single bit over theswitch and the link [Hu and Marculescu 2005]. Noxim and Wormsim simulators aretwo examples. The second method is to include the Orion [Kahng et al. 2012] library,which is a detailed NoC power model, in the simulators. For example, Booksim andGarnet simulators have integrated Orion. Wormsim also provides an option to evalu-ate power using Orion model.

4.2. Addressing scalability challenge in NoC simulationsPreviously, for flit-level NoC simulators, there are built on either cycle-based or event-driven models [Dally and Towles 2003]. Cycle-based simulators check and update thenetwork status (e.g., the buffer occupancy, the allocation request/response) every cycle.On the other hand, event-driven simulators update the network status only at the spe-cific instances when an event/transaction was previously scheduled in the event queue.Both these two simulators accurately predict the design performance in the system

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 27: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:27

level. However, with the advancement in IC integration, there may be hundreds orthousands of cores on future NoC systems [Borkar 2009]. An emerging challenge forboth cycle-based and event-driven evaluations is to provide fast and efficient simula-tion of a large scale NoC system. In the following, we summarize several representativeapproaches:

— High level abstraction for simulation: INSEE [Ridruejo Perez and Miguel-Alonso2005] is such an simulation framework which contains a functional simulator namedFSIN. Different from cycle-accurate simulators, FSIN omits some modeling detailssuch as cycle-accurate behaviors of the routers. Therefore, the simulator is simpleand fast for early phase design space exploration. The users can use FSIN to simu-late a much larger network compared to the SICOSYS [Puente et al. 2002] simulator.However, the accuracy is compromised in such functional simulators.

— Reduce simulation points by statistical sampling techniques: To compensate the ac-curacy loss in functional simulation, one way is to combine detailed and functionalsimulations. In general, the statistical sampling techniques help to choose only aportion of the application that needs to be fully simulated [Dai and Jerger 2014]. Forthe other parts, either only functional simulation is conducted or they are simplyskipped. In [Dai and Jerger 2014], the idea of previous sampling based full-systemsimulation (e.g., [Carlson et al. 2013],[Ardestani and Renau 2013] and [Wenischet al. 2006]) is extended to NoC-based systems. Two schemes are proposed. The firstscheme uses sampling theory to explore the size of samples based on the metric (e.g.,latency) variation and confident interval requirements. The second scheme identi-fies different phases in the application by using the clustering/classification method.Then, the sampling is done by choosing samples from each class. It has been demon-strated that both approaches can achieve an order of magnitude speedup with a smallerror (less than 10%).

— Exploring capabilities for parallel simulation: To enable fast and large-scalesimulation, instead of executing the simulator program sequentially on a hostPC/workstation, parallel computing resources (e.g., multithreads or multicore re-sources on a CPU or GPU card) are exploited to accelerate the overall evaluationprocess [Eggenberger and Radetzki 2013]. In [Lis et al. 2011], a highly flexible andmultithreaded-based NoC simulator named HORNET is developed. Specifically, par-allel evaluation is achieved by dividing the whole NoC system into several differentparts; each part is mapped and scheduled onto an individual thread. During thesimulation, different threads are synchronized periodically to exchange the networkstatus of different regions. The synchronizing frequency determines the tradeoff be-tween the simulation speedup and the evaluation accuracy. On one hand, frequentsynchronizations ensure each region to keep pace with the state changes that happenin the neighboring region. On the other hand, frequent synchronizations create bar-riers and significantly affect the overall speedup [Eggenberger and Radetzki 2013].In [Zolghadr et al. 2011], an NoC simulator is developed on Nvidia GPU platformbased on CUDA library. In this simulator, each GPU thread is used for the evalua-tion of a router or a link. Every cycle, the shared or global memory of GPU is usedto synchronize/update the transactions. Therefore, it requires a significant effort onsynchronization among threads; the overall speedup performance under very largenetwork size is limited [Eggenberger and Radetzki 2013]. In [Pinto et al. 2011], theGeneral Purpose GPU (GPGPU) based simulator for large scale NoC system is de-signed. Each GPU thread is used to evaluate a tile (the core and router). The GPUglobal memory is used to store the packets to speedup the data access across differ-ent threads. To simulate a large-scale network, the instruction instead of cycle levelaccuracy is provided. Recently, in [Eggenberger and Radetzki 2013], the authors pro-

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 28: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:28 Z. Qian et al.

pose a simulation method with high scalability. Specifically, the simulation tasks areordered first based on their dependencies. The maximum number of tasks that canbe simulated simultaneously without causing interferences with each other is identi-fied. An efficient synchronization scheme is then used together with a load balancingscheme to dynamically distribute the simulation workload among different threadswithin the host PC. Based on above optimizations, when executing the simulationson a multi-core host computer, a significant speedup can be achieved even for an NoCwith more than one thousand processors.

4.3. NoC benchmarking effortsStandardization efforts have been taken to provide methodologies for NoC benchmark-ing. The Open Core Protocol International Partnership Association (OCP-IP) formed aworking group on NoC benchmarking [Mackintosh 2008] to propose standard bench-marks and methods for evaluating and comparing NoC designs with different topology,traffic and router implementation characteristics [Grecu et al. 2007]. With a similartarget, the NoCBench [NoCbench 2011] project has also released its models which con-tains benchmark traffic and router models for performance evaluation.

5. COMBINING ADVANTAGES OF ANALYTICAL MODELS AND SIMULATIONSThe NoC analytical models discussed in Section 3 provides a fast way for evaluatingNoC performance in the early design stage. On the other hand, the NoC simulatorsclosely model the real system behaviors at run time but require significantly longertime. Therefore, one interesting research direction in NoC performance evaluation isto combine the advantages in these two approaches and explore a new way for estimat-ing NoC performance. In this section, we review several interesting progresses towardsthis end. We first discuss using FPGAs to accelerate the simulations and then intro-duce the machine learning based techniques to improve the analytical model accuracyfrom simulation training samples.

5.1. Hardware based NoC simulatorInstead of implementing the software-based NoC simulators, one recent direction inNoC emulator design is to use real hardware platform for performance evaluation. Ofnote, this is different from previous work [Wolkotte et al. 2007; Genko et al. 2005]which implements the NoC exactly as an ASIC chip or a FPGA embedded systemfor prototyping. The concept of hardware based NoC simulator still tries to explorea higher level abstraction of the router model (for example, using the FPGA to im-plement the mathematical models as in [Papamichael et al. 2011] or the behavioralmodels as in [Wang et al. 2011]). The motivation of implementing such a high levelNoC model on hardware is from two aspects [Wang et al. 2011]: first, the abstractionin a higher level takes the advantage of larger flexibility. Therefore, they are suitableto explore and compare different design choices before the data- and control- path be-ing fixed. Second, compared to executing the simulation as a software program, thehardware based simulator emulates different PEs and routers in parallel every cy-cle. Therefore, the simulator is more scalable for evaluating a large network size andreduces the overall evaluation time.

For example, in [Wang et al. 2011], a flexible and efficient FPGA-based simulationarchitecture was developed. The authors claimed instead of implementing the systemexactly on hardware, it is more flexible to map the major data- and control- path com-ponents as generic library components in FPGA. For different NoC architectures to beevaluated, the users just need to make a collection of corresponding basic elementsfrom their library (e.g., the specific arbiters and buffers). In this way, the FPGA sim-ulation engine can be viewed as a virtualization of the architecture to be explored.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 29: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:29

Booksim 2.0 simulator

Training traffic

Form labels

Waiting times based on queuing model

SV Regression SV Regression

fCQ fSQ

New traffic inputs ARCG(T,P)

Form features

Channel waiting and source queuing

delay

XCQ , XSQ

Training flow Applying SVR-NoC

Link dependency analysis

Link order

NoC queuing model

XCQ XSQ

Flow latency

Ls,d

Simulated waiting times

Application info

Channel waiting time

source waiting time

Fig. 10. The workflow chart of the SVR-NoC model [Qian et al. 2013; 2015]: the overall framework trainsthe collected data first to obtain the waiting time models fCQ and fSQ. The obtained SVR models are appliedon the new traffic ARCG to predict the flow delay.

Moreover, by using hardware rather than a software program to conduct simulations,an over 100× speedup compared to the Booksim simulator can be achieved withoutdegradation in the final simulation accuracy [Wang et al. 2011].

In [Papamichael et al. 2011], a fast and simple FPGA based simulator (FIST) forevaluating NoC latency is developed. The objective of FIST simulator is to implementan analytical router model in hardware so as to replace the detailed full-system evalu-ations. The main idea of FIST design is to model each router as a lookup table in FPGAwhich corresponds to a load-delay curve identified from the queuing-theory based mod-els. At run time, during every time window, by checking the current load at the inter-mediate router, a delay value can be obtained which determines the waiting time of apacket in the current router buffer before being routed. By following the packet routingpath and adding together the load-dependent delays at each intermediate router, theuser can obtain an end-to-end delay estimation for a specific packet/flow. Comparedto the conventional queuing models, the FIST simulator adjusts the delays based onthe router’s load dynamically. Therefore, it better reflects the temporal behaviors (e.g.,latency variations over different time period) of a traffic flow. Compared to the cycle-based full-system simulation, a 43× speedup and less than 6% error are reported forthe FIST simulator [Papamichael et al. 2011].

5.2. Learning based NoC performance evaluationAs discussed in Section 3, most NoC analytical models are based on queuing theory,which use the M/M/1, M/D/1, M/G/1/K or G/G/1 formula to estimate the channel wait-ing times. These models can achieve high prediction precision if some assumptions aremet, such as the injection process of the packet headers is Poisson [Ogras et al. 2010].

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 30: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:30 Z. Qian et al.

However, for the cases where the application or the NoC configuration does not sat-isfy these assumptions, the accuracy of the analytical model degrades. To address suchproblem, in [Qian et al. 2013; 2015], SVR-NoC, a support vector regression based la-tency model is proposed for NoC latency prediction. The rationale of the SVR-NoC isto refine the analytical models by applying learning techniques on the collected simu-lation samples. To be more specific, the workflow of the SVR-NoC can be summarizedas follows (shown in Fig. 10) [Qian et al. 2013; 2015]: First, based on the NoC ar-chitecture on which the application is mapped and executed, several synthetic trafficpatterns are fed into a corresponding simulator to collect training data samples. Thenon-parametric support vector regression [Vapnik 1998] is then applied on the col-lected training data to estimate the relationship between the waiting times and theirfeatures. In the learning process, the users need to collect two sets of features, i.e., XCQ

and XSQ which correspond to the features of the waiting times at the buffer channels(i.e., the channel waiting delay CQ) and source queues (i.e., the source waiting delaySQ). Of note, the features in the set XCQ and XSQ include the factors that directlyaffect the channel and source waiting times. For example, the arrival rate elementsin XCQ and XSQ reflect the traffic load injected onto the channel, while the forward-ing probability elements in XCQ aim to recognize the potential contentions amongthe flows within the same router. After applying support vector regression (SVR) forlearning, two model functions, namely the channel queuing and source queuing model(fCQ and fSQ) are obtained from the training data. During the training process, in or-der to avoid data-overfitting, cross-validation needs to be used to determine the besthyper-parameter combinations [Vapnik 1998]. Finally, to apply the learning model toevaluate the NoC latency of a new application, the input traffic is first characterizedas an architecture characterization graph (ARCG) discussed in Section 2.1. Then, foreach source channel and intermediate link channels, the corresponding feature vectorsXCQ and XSQ are calculated based on the extracted ARCG. By applying the regressionfunctions (fCQ and fSQ) on the new feature sets, the waiting times at each link chan-nels as well as the overall flow latency can be predicted accordingly.

6. OPEN PROBLEMS AND RESEARCH PERSPECTIVESBased on discussions in the previous sections, several open problems that need to beaddressed are identified and some potential research directions are presented in thefollowing discussions:

— Modeling non-stationary and time-dependent NoC traffic: The major problem ofmemorlyess (i.e., Poisson or Markovian) and short-range dependent traffic models(i.e., GE, 2-state MMPP) is that they cannot capture the time-dependent arrival pro-cesses. For NoCs under light or medium load, the temporal correlations among thepackets are not obvious. Therefore, these models may still provide good approxima-tions. However, under heavy workload (i.e., high packet injection rates [Bogdan andMarculescu 2011]), the contentions for the shared resources introduce a long tail ofthe arrival and service time distribution, which indicates the earlier packets in thenetwork may affect their subsequent ones. The accuracy of delay prediction applyingsuch models is greatly reduced. To overcome this limitation, there are two possible di-rections. The first direction is to generalize over the current Poisson or 2-state MMPPtraffic models. Specifically, in the subject of Internet traffic modeling and operationalresearch, the more general processes such as Batch Markovian Arrival Processes(BMAP) have been used to model the long-range dependency relations [Klemm et al.2002]. The self-similarity and traffic burstiness can be more efficiently character-ized. The second direction is to use the multi-fractal traffic model. Currently, themodel is complicated and requires a set of master equations to be solved. An efficient

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 31: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:31

implementation that can be embedded in the whole synthesis loop is an interestingresearch topic.

— Developing corresponding performance models for multi-fractal traffic: The multi-fractal analysis improves the accuracy over previous models, especially when thesystem works near the saturation point. Moreover, when fixing some of the expo-nent parameters, it has been observed that multi-fractal model reduces to the pre-vious memoryless or short-range dependent models. In conclusion, the multi-fractalanalysis framework provides a more general solution in both the traffic analysis andperformance prediction. However, the complexity of solving master equations is muchhigher than other models. Also, current research efforts only focused on the workloadcharacterization. The performance evaluation model (e.g., models for average-case,worst-case or statistical metrics) based on multi-fractal input is still rarely explored.One research direction is to consider solving the multi-fractal formalism presentedin [Bogdan 2015] and derive the probability of large (rare) events in the formalism.Then, these probabilities could be used to compute the real-time performance metricssuch as the delay bounds.

— Relaxing assumptions in current queuing models: The common limitation of analyt-ical queuing models is they rely on different assumptions that are made before thederivation. The accuracy and usage is constrained by how close the real cases matchwith the assumptions. For the average-case prediction, as summarized in Table III,each queueing model is built based on its specific traffic input and router architecture(e.g., buffer size, arbitration policy). The accuracy degrades if the model is applied to avery different situation. Recently, some generalization efforts have been made, suchas assuming general independent arrival and service processes, removing the as-sumptions on the buffer depths and packet sizes. However, these models still do notcapture the time-dependency among the inter-arrival and service processes. There-fore, one future direction is to extend the queueing model with long-range dependentarrival and service processes while relaxing the assumptions on the topology, buffersize and arbitration scheme at the same time.

— Transient analysis for Network-on-Chips: Most analytical queuing models only focuson the aggregated average performance (e.g., mean delay). However, the authors in[Ohmann et al. 2014] pointed out for non-stationary applications, since the system isunstable, transient behavioral analysis is more useful. In [Ohmann et al. 2014], theyalso propose a transient queueing model for a single router element. They discussed,for this emerging direction, additional efforts are required to provide a working modelfor a router network. Moreover, the methods of applying transient analysis for betterflow control or buffer sizing are also desired [Bogdan 2015].

— Efficient resources management strategies: The NoC performance models have beenwidely used in buffer sizing [Du et al. 2014] and flow regulation [Jafari et al. 2010].However, if the workload is time-dependent, the configuration decision made maybe optimistic and not satisfy the QoS requirement [Varatkar and Marculescu 2004].This limits the usage of such models for some real-time systems. On the other hand,if a multi-fractal traffic model is used, it requires to develop an efficient hardwareimplementation which can support collecting the information and re-acting to thenetwork congestions in time.

— Developing efficient approaches for modeling NoC resources contentions in real-timesystems: In NoC, the resources (e.g., link channels) are shared among different flows.Therefore, it has been realized in research that the contentions are the major causesof delay variations. How to efficiently model the effects of resources sharing is onekey problem. Especially for real time systems with QoS requirement, over-estimatingcontention delays introduces additional resources overhead while under-estimatingthe delays causes a potential system failure. In the worst-case analytical models,

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 32: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:32 Z. Qian et al.

the real bound analysis makes a conservative assumption that the target flow al-ways loses arbitration. The bound tightness is compromised. The network calculusbounds depend on whether the arrival and service curves can capture the real-timeor statistical behaviors. How to model a network with more complicated contentionscenarios still needs to be explored. For the schedulability and data flow analysis,they make assumptions on the priority assignment of the traffic flows. Different VCsin the same port may be allocated to different priority levels. Therefore, they aremostly suitable for priority-based architectures. For best-effort NoC or real time NoCwith soft-deadline requirements, these two approaches cannot be directly applied.

— Addressing scalability challenge in NoC simulators design: For a large-scale systemwith more than hundreds of cores, the full-system simulation is unaffordable. Evenwith current parallel simulation techniques, how to efficiently partition the simula-tions among multiple threads and address the synchronization barrier is still diffi-cult. Statistical sampling techniques help to reduce the simulation points. However,the determination of sampling strategy is application-specific which relies on the pre-analysis of the application characteristics. Another direction is to use hardware basedmodeling or learning based techniques to accelerate the evaluation. For the learningbased modeling approach, the current model only works for the average latency pre-diction. How to reduce the training data size and extend the learning method formore performance metrics (e.g., worst-case or statistical delay) remain open to thewhole community.

7. CONCLUSIONSIn this survey, we have reviewed several techniques for NoC performance evaluation.These techniques range from building the traffic models to developing the analyticaland simulation-based latency models. We first summarized the typical workloads thatare employed in NoC-based system evaluation. Then, we have reviewed the approachesin traffic analysis for capturing both the short- and long-range dependence behaviors.In addition, a multi-fractal based approach to model the non-stationary traffic is in-troduced. We concluded the traffic analysis techniques by summarizing the featuresof each traffic model. For the NoC latency performance evaluation, we have presentedthe techniques to predict the average-case and worst-case delay. The simulation-basedevaluation models are also reviewed. We have also reviewed the state-of-the-art pro-gresses that combines the advantages in analytical models and simulations to providerapid and scalable performance evaluations. The performance evaluation for NoC-based multicore system still faces many interesting challenges. We have discussedseveral open problems and potential research directions towards this purpose.

ACKNOWLEDGEMENTThe authors would like to thank the reviewers for their detailed and valuable com-ments and suggestions.

REFERENCESC. Ababei, P. P. Pande, and S. Pasricha. 2012. Network-on-chips (NoC) Blog. (2012). http://networkonchip.

wordpress.com/P. Abad, P. Prieto, L. G. Menezo, A. Colaso, V. Puente, and J. A. Gregorio. 2012. TOPAZ: An Open-Source

Interconnection Network Simulator for Chip Multiprocessors and Supercomputers. In Networks on Chip(NoCS), 2012 Sixth IEEE/ACM International Symposium on. 99–106.

N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. 2009. GARNET: A detailed on-chip network model insidea full-system simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEEInternational Symposium on. 33 –42.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 33: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:33

E.K. Ardestani and J. Renau. 2013. ESESC: A fast multicore simulator using Time-Based Sampling. InHigh Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on.448–459.

M. Arjomand and H. Sarbazi-Azad. 2009. A comprehensive power-performance model for NoCs with multi-flit channel buffers. In Proceedings of the 23rd international conference on Supercomputing (ICS ’09).ACM, New York, NY, USA, 470–478.

M. Arjomand and H. Sarbazi-Azad. 2010. Power-Performance Analysis of Networks-on-Chip With ArbitraryBuffer Allocation Schemes. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on 29, 10 (Oct 2010), 1558–1571.

Atlas. 2011. Atlas environment for Network-on-Chips. (2011). http://corfu.pucrs.br/redmine/projects/atlas/wiki

Marco Bekooij, Orlando Moreira, O Moreira, Peter Poplavko, Bart Mesman, Milan Pastrnak, and Jef VanMeerbergen. 2004. Predictable Embedded Multiprocessor System Design. In In Proc. InternationalWorkshop on Software and Compilers for Embedded Systems (SCOPES), LNCS 3199. Springer.

Y. Ben-Itzhak, I. Cidon, and A. Kolodny. 2011. Delay analysis of wormhole based heterogeneous NoC. InNetworks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on. 161–168.

Y. Ben-Itzhak, E. Zahavi, I. Cidon, and A. Kolodny. 2012. HNOCS: Modular open-source simulator for Het-erogeneous NoCs. In Embedded Computer Systems (SAMOS), 2012 International Conference on. 51–57.

L. Benini and G. De Micheli. 2002. Networks on chips: a new SoC paradigm. Computer 35, 1 (jan 2002), 70–78.

D. Bertozzi and L. Benini. 2004. Xpipes: a network-on-chip architecture for gigascale systems-on-chip. Cir-cuits and Systems Magazine, IEEE 4, 2 (2004), 18–31.

D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli. 2005. NoCsynthesis flow for customized domain specific multiprocessor systems-on-chip. Parallel and DistributedSystems, IEEE Transactions on 16, 2 (2005), 113–129.

Paul Bogdan. 2015. Mathematical Modeling and Control of Multifractal Workloads for Data-Center-on-a-Chip Optimization. In Networks-on-Chip (NoCS), 2015 Ninth IEEE/ACM International Symposium on.

P. Bogdan, M. Kas, R. Marculescu, and O. Mutlu. 2010. QuaLe: A Quantum-Leap Inspired Model for Non-stationary Analysis of NoC Traffic in Chip Multi-processors. In Networks-on-Chip (NOCS), 2010 FourthACM/IEEE International Symposium on. 241–248.

P. Bogdan and R. Marculescu. 2009. Statistical physics approaches for network-on-chip traffic characteri-zation. In Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesignand system synthesis (CODES+ISSS ’09). ACM, New York, NY, USA, 461–470.

P. Bogdan and R. Marculescu. 2010. Workload characterization and its impact on multicore platform de-sign. In Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2010 IEEE/ACM/IFIPInternational Conference on. 231–240.

P. Bogdan and R. Marculescu. 2011. Non-Stationary Traffic Analysis and Its Implications on MulticorePlatform Design. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30,4 (2011), 508–519.

Paul Bogdan and Yuankun Xue. 2015. Mathematical Models and Control Algorithms for Dynamic Optimiza-tion of Multicore Platforms: A Complex Dynamic Approach. In Computer-Aided Design, 2015. ICCAD2015. IEEE/ACM International Conference on.

S. Borkar. 2009. Design perspectives on 22nm CMOS and beyond. In Design Automation Conference, 2009.DAC ’09. 46th ACM/IEEE. 93–94.

J. Y. Le Boudec and P. Thiran. 2004. Network Calculus: A Theory of Deterministic Queuing Systems for theInternet. Lecture Notes in Computer Science,Springer-Verlag, Berlin, Germany.

T.E. Carlson, W. Heirman, and L. Eeckhout. 2013. Sampled simulation of multi-threaded applications. InPerformance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. 2–12.

G. Casale, E. Z. Zhang, and E. Smirni. 2008. KPC-Toolbox: Simple Yet Effective Trace Fitting Using Marko-vian Arrival Processes. In Quantitative Evaluation of Systems, 2008. QEST ’08. Fifth InternationalConference on. 83–92.

S. Chakraborty, S. Kunzli, and L. Thiele. 2003. A general framework for analysing system properties inplatform-based embedded system designs. In Design, Automation and Test in Europe Conference andExhibition, 2003. 190–195.

C. S. Chang. 2000. Performance Guarantees in Communication Networks. Springer-Verlag, New York.S. Chatterjee, M. Kishinevsky, and U.Y. Ogras. 2012. xMAS: Quick Formal Modeling of Communication

Fabrics to Enable Verification. Design Test of Computers, IEEE 29, 3 (2012), 80–88.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 34: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:34 Z. Qian et al.

Connect. 2011. Configurable NEtwork Creation Tool. (2011). http://users.ece.cmu.edu/~mpapamic/connect/Wenbo Dai and N.E. Jerger. 2014. Sampling-based approaches to accelerate network-on-chip simulation. In

Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on. 41–48.W. Dally. 1992. Virtual-channel flow control. Parallel and Distributed Systems, IEEE Transactions on 3, 2

(1992), 194–205.W.J. Dally and B. Towles. 2001. Route packets, not wires: on-chip interconnection networks. In Design Au-

tomation Conference, 2001. Proceedings. 684–689.W. Dally and B. Towles. 2003. Principles and Practices of Interconnect Networks. Morgan Kaufmann, San

Francisco, CA.J. Diamond and A. Alfa. 2000. On approximating higher-order MAPs with MAPs of order two. Queueing

systems 34 (2000), 269–288.G. Donald and C. M. Harris. 2008. Fundamentals of Queueing Theory. Wiley.Gaoming Du, Miao Li, Zhonghai Lu, Minglun Gao, and Chunhua Wang. 2014. An analytical model for worst-

case reorder buffer size of multi-path minimal routing NoCs. In Networks-on-Chip (NoCS), 2014 EighthIEEE/ACM International Symposium on. 49–56.

M. Eggenberger and M. Radetzki. 2013. Scalable parallel simulation of networks on chip. In Networks onChip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. 1–8.

E. Fischer and G.P. Fettweis. 2013. An accurate and scalable analytic model for round-robin arbitration innetwork-on-chip. In Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on.1–8.

W. Fischer and K. Meier-Hellstern. 1993. The Markov-modulated poisson process (MMPP) cookbook. Perfor-mance Evaluation, Elsevier 18, 2 (1993), 149–171.

Jose Flich and Davide Bertozzi (Eds.). 2010. Designing Network On-Chip Architectures in the Nanoscale Era.Chapman and Hall/CRC.

S. Foroutan, Y. Thonnart, and F. Petrot. 2013. An Iterative Computational Technique for Performance Eval-uation of Networks-on-Chip. Computers, IEEE Transactions on 62, 8 (Aug 2013), 1641–1655.

H. de Meer G. Bolch, S. Greiner and K. S. Trivedi. 2006. Queueing Networks and Markov Chains: Modelingand Performance Evaluation with Computer Science Applications,2nd Edition. John Wiley and Sons.

Gem5. 2009. Gem5 simulator. (2009). http://www.m5sim.org/N. Genko, D. Atienza, G. De Micheli, J. M. Mendias, R. Hermida, and F. Catthoor. 2005. A Complete Network-

On-Chip Emulation Framework. In Proceedings of the conference on Design, Automation and Test inEurope - Volume 1 (DATE ’05). 246–251.

gMemNoCsim. 2011. gMemNoCsim simulator. (2011). http://www.gap.upv.es/index.php?option=com_content&view=article&id=72&Itemid=108

Graphite. 2010. Graphite simulator. (2010). http://groups.csail.mit.edu/carbon/P. Gratz and S. W. Keckler. 2010. Realistic Workload Characterization and Analysis for Networks-on-Chip

Design. In The 4th Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI).Cristian Grecu, Andre Ivanov, Partha P, Axel Jantsch, Erno Salminen, and Radu Marculescu. 2007. An

Initiative towards Open Network-on-Chip Benchmarks.Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. 2007. Network Delays and Link Capacities

in Application-Specific Wormhole NoCs. VLSI Design (2007).A. Hansson, M. Wiggers, A. Moonen, K. Goossens, and M. Bekooij. 2008. Applying Dataflow Analysis to

Dimension Buffers for Guaranteed Performance in Networks on Chip. In Networks-on-Chip, 2008. NoCS2008. Second ACM/IEEE International Symposium on. 211–212.

J. Hestness and S. W. Keckler. 2010. Netrace: Dependency-Driven, Trace-Based Network-on-Chip Simula-tion". In 3rd International Workshop on Network on Chip Architectures (NoCArc).

HNoC. 2013. HNoC simulator. (2013). http://hnocs.eew.technion.ac.il/H. Hossain, M. Ahmed, A. Al-Nayeem, T.Z. Islam, and M.M. Akbar. 2007. Gpnocsim - A General Purpose

Simulator for Network-On-Chip. In Information and Communication Technology, 2007. ICICT ’07. In-ternational Conference on. 254–257.

Jingcao Hu and R. Marculescu. 2003. Energy-aware mapping for tile-based NoC architectures under per-formance constraints. In Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asiaand South Pacific. 233–239.

J. Hu and R. Marculescu. 2004a. Application-specific buffer space allocation for networks-on-chip routerdesign. In Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on. 354–361.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 35: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:35

J. Hu and R. Marculescu. 2004b. Energy-aware communication and task scheduling for network-on-chiparchitectures under real-time constraints. In Design, Automation and Test in Europe Conference andExhibition, 2004. Proceedings, Vol. 1. 234–239 Vol.1.

Jingcao Hu and R. Marculescu. 2005. Energy- and performance-aware mapping for regular NoC architec-tures. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 24, 4 (April2005), 551–562.

P. C. Hu and L. Kleinrock. 1997. An Analytical Model for Wormhole Routing with Finite Size Input Buffers.In 15th International Teletraffic Congress,University of California, Los Angeles.

E.A.F. Ihlen. 2012. Introduction to Multifractal Detrended Fluctuation Analysis in Matlab. Frontiers inPhysiology 3 (2012), 141.

F. Jafari, Z. Lu, A. Jantsch, and M. H. Yaghmaee. 2010. Buffer Optimization in Network-on-Chip ThroughFlow Regulation. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29,12 (2010), 1973–1986.

J.Beran. 1994. Statistics for long-memory processes. Chapman and Hall.Nan Jiang, D.U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D.E. Shaw, J. Kim, and W.J. Dally. 2013.

A detailed and flexible cycle-accurate Network-on-Chip simulator. In Performance Analysis of Systemsand Software (ISPASS), 2013 IEEE International Symposium on. 86–96.

Y. Jiang and Y. Liu. 2008. Stochastic Network Calculus. Springer-Verlag, London, UK.A. B. Kahng, B. Li, L. S. Peh, and K. Samadi. 2012. ORION 2.0: A Power-Area Simulator for Interconnection

Networks. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 20, 1 (Jan. 2012), 191–196.

J. W. Kantelhardt, S. A. Zschiegner, E. Koscielny-Bunde, S. Havlin, A. Bunde, and H. E. Stanley. 2002. Mul-tifractal detrended fluctuation analysis of nonstationary time series. Physica A Statistical Mechanicsand its Applications 316 (Dec. 2002), 87–114.

A. E. Kiasari, A. Jantsch, and Z. Lu. 2013a. Mathematical formalisms for performance evaluation ofnetworks-on-chip. ACM Comput. Surv. 45, 3, Article 38 (July 2013), 41 pages.

A. E. Kiasari, Z. Lu, and A. Jantsch. 2013b. An Analytical Latency Model for Networks-on-Chip. Very LargeScale Integration (VLSI) Systems, IEEE Transactions on 21, 1 (jan. 2013), 113 –123.

A. E. Kiasari, D. Rahmati, H. Sarbazi-Azad, and S. Hessabi. 2008. A Markovian Performance Model forNetworks-on-Chip. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Eu-romicro Conference on. 157–164.

L. Kleinrock. 1975. Queueing Systems, Volume I: Theory. Wiley.A. Klemm, C. Lindemann, and M. Lohmann. 2002. Traffic Modeling of IP Networks Using the Batch Marko-

vian Arrival Process. In Computer Performance Evaluation: Modelling Techniques and Tools, Tony Field,PeterG. Harrison, Jeremy Bradley, and Uli Harder (Eds.). Lecture Notes in Computer Science, Vol. 2324.Springer Berlin Heidelberg, 92–110.

Hisashi Kobayashi. 1974. Application of the Diffusion Approximation to Queueing Networks I: EquilibriumQueue Distributions. J. ACM 21, 2 (April 1974), 316–328.

D. D. Kouvatsos, A. SALAM, and M. Ould-Khaoua. 2005. Performance modeling of wormhole-routed hyper-cubes with bursty traffic and finite buffers. Int J Simul Pract Syst Sci Technol 6 (2005), 69–81.

P. J. Kuhn. 2013. Tutorial on Queuing Theory. University of Stuttgart.M. C. Lai, L. Gao, N. Xiao, and Z. Y. Wang. 2009. An accurate and efficient performance analysis approach

based on queuing model for network on chip. In Computer-Aided Design - Digest of Technical Papers,2009. ICCAD 2009. IEEE/ACM International Conference on. 563 –570.

S. Lee. 2003. Real-time wormhole channels. Journal Of Parallel And Distributed Computing 63 (2003), 299–311.

M. Lis, P. Ren, M. H. Cho, K. S. Shim, C. W. Fletcher, O. Khan, and S. Devadas. 2011. Scalable, accuratemulticore simulation in the 1000-core era. In Performance Analysis of Systems and Software (ISPASS),2011 IEEE International Symposium on. 175–185.

Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Pengju Ren, Omer Khan, and Srinivas Devadas. 2010.DARSIM: a parallel cycle-level NoC simulator. In MoBS 2010 - Sixth Annual Workshop on Model-ing, Benchmarking and Simulation, Lieven Eeckhout and Thomas Wenisch (Eds.). Saint Malo, France.https://hal.inria.fr/inria-00492982

W. Liu, J. Xu, X. Wu, Y. Ye, X. Wang, W. Zhang, M. Nikdast, and Z. Wang. 2011. A NoC Traffic Suite Basedon Real Applications. In VLSI (ISVLSI), 2011 IEEE Computer Society Annual Symposium on. 66–71.

Mario Lodde and Jose Flich. 2012. Memory Hierarchy and Network Co-design through Trace-Driven Simu-lation. In In Proc. of 7th International Summer School on Advanced Computer Architecture and Compi-lation for High-Performance and Embedded Systems.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 36: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:36 Z. Qian et al.

R. Lopes and N. Betrouni. 2009. Fractal and multifractal analysis: A review. Medical Image Analysis 13, 4(2009), 634 – 649.

Zhonghai Lu, Rikard Thid, Mikael Millberg, Erland Nilsson, and Axel Jantsch. 2005. NNSE: NostrumNetwork-on-Chip Simulation Environment. In Swedish System-on-Chip Conference (SSoCC). 1–4.

Zhonghai Lu, Yuan Yao, and Yuming Jiang. 2014. Towards stochastic delay bound analysis for Network-on-Chip. In Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on. 64–71.

O. Lysne. 1998. Towards a Generic Analytical Model of Wormhole Routing Networks. microprocessors andmicrosystems 21, 7-8 (1998), 491–498.

Ian R. Mackintosh. 2008. OCP-IP NoC Benchmarking WG activities. Design Test of Computers, IEEE 25, 5(Sept 2008), 504–504.

S. Mahadevan, F. Angiolini, M. Storoaard, R. G. Olsen, J. Sparsoe, and J. Madsen. 2005. Network trafficgenerator model for fast network-on-chip simulation. In Design, Automation and Test in Europe, 2005.Proceedings. 780–785 Vol. 2.

R. Marculescu and P. Bogdan. 2009. The Chip Is the Network: Toward a Science of Network-on-Chip Design.Foundations and Trends in Electronic Design Automation 2, 4 (2009), 371–461.

G. Min and M. Ould-Khaoua. 2004. A performance model for wormhole-switched interconnection networksunder self-similar traffic. Computers, IEEE Transactions on 53, 5 (2004), 601–613.

A. Nayebi, S. Meraji, A. Shamaei, and H. Sarbazi-Azad. 2007. XMulator: A Listener-Based Integrated Sim-ulation Platform for Interconnection Networks. In Modelling Simulation, 2007. AMS ’07. First AsiaInternational Conference on. 128–132.

Netmaker. 2009. Netmaker interconnection networks simulator. (2009). http://www-dyn.cl.cam.ac.uk/~rdm34/wiki/index.php?title=Main_Page

N. Nikitin and J. Cortadella. 2009. A performance analytical model for Network-on-Chip with constantservice time routers. In Computer-Aided Design - Digest of Technical Papers, 2009. ICCAD 2009.IEEE/ACM International Conference on. 571–578.

NIRGAM. 2007. NIRGAM simulator. (2007). http://nirgam.ecs.soton.ac.uk/home.phpNoCbench. 2011. NoCbench. (2011). http://www.tkt.cs.tut.fi/research/nocbench/index.htmlNoxim. 2011. Noxim simulator. (2011). http://noxim.sourceforge.net/OCCN. 2003. OCCN modeling framework. (2003). http://occn.sourceforge.net/U. Y. Ogras, P. Bogdan, and R. Marculescu. 2010. An Analytical Approach for Network-on-Chip Perfor-

mance Analysis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29,12 (2010), 2001–2013.

D. Ohmann, E. Fischer, and G. Fettweis. 2014. Transient queuing models for input-buffered routers inNetwork-on-Chip. In Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on.57–63.

M. Ould-Khaoua. 1999. A performance model for Duato’s fully adaptive routing algorithm in k-ary n-cubes.Computers, IEEE Transactions on 48, 12 (1999), 1297–1304.

Michael K. Papamichael and James C. Hoe. 2012. CONNECT: Re-examining Conventional Wisdom for De-signing Nocs in the Context of FPGAs. In Proceedings of the ACM/SIGDA International Symposium onField Programmable Gate Arrays (FPGA ’12). 37–46.

M. K. Papamichael, J. C. Hoe, and O. Mutlu. 2011. FIST: A fast, lightweight, FPGA-friendly packet la-tency estimator for NoC modeling in full-system simulations. In Networks on Chip (NoCS), 2011 FifthIEEE/ACM International Symposium on. 137–144.

K. Park and W. Willinger. 2000. Self-similar network traffic and performance evaluation. John Wiley andSons, New York.

PARSEC. 2009. PARSEC Benchmark Suite. (2009). http://parsec.cs.princeton.edu/V. Paxson. 1997. Fast, approximate synthesis of fractional gaussian noise for generating self-similar network

traffic. Computer Communication Review 27 (1997), 5–18.Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined

Routers. In Proceedings of the 7th International Symposium on High-Performance Computer Architec-ture (HPCA ’01). IEEE Computer Society, Washington, DC, USA, 255–.

Physionet. 2004. A Brief Overview of Multifractal Time Series. (2004). http://www.physionet.org/tutorials/multifractal/index.shtml

C. Pinto, S. Raghav, A. Marongiu, M. Ruggiero, D. Atienza, and L. Benini. 2011. GPGPU-Accelerated Paralleland Fast Simulation of Thousand-Core Platforms. In Cluster, Cloud and Grid Computing (CCGrid),2011 11th IEEE/ACM International Symposium on. 53–62.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 37: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:37

Subodh Prabhu. 2010. OCIN_TSIM-a DVFS aware simulator for NoC design space exploration and opti-mization. M.Sc. thesis. Texas A&M University.

V. Puente, J.A. Gregorio, and R. Beivide. 2002. SICOSYS: an integrated framework for studying intercon-nection network performance in multiprocessor systems. In Parallel, Distributed and Network-basedProcessing, 2002. Proceedings. 10th Euromicro Workshop on. 15–22.

A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L. Raffo, G. De Micheli, and L. Benini. 2007. NoCDesign and Implementation in 65nm Technology. In Networks-on-Chip, 2007. NOCS 2007. First Inter-national Symposium on. 273–282.

Bo Qian and Khaled Rasheed. 2004. HURST EXPONENT AND FINANCIAL MARKET PREDICTABILITY.(2004).

Y. Qian, Z. Lu, and Q. Dou. 2010a. QoS scheduling for NoCs: Strict Priority Queueing versus WeightedRound Robin. In Computer Design (ICCD), 2010 IEEE International Conference on. 52–59.

Y. Qian, Z. Lu, and W. Dou. 2009a. Analysis of communication delay bounds for network on chips. In DesignAutomation Conference, 2009. ASP-DAC 2009. Asia and South Pacific. 7–12.

Y. Qian, Z. Lu, and W. Dou. 2009b. Analysis of worst-case delay bounds for best-effort communication inwormhole networks on chip. In Networks-on-Chip, 2009. NoCS 2009. 3rd ACM/IEEE InternationalSymposium on. 44–53.

Y. Qian, Z. Lu, and W. Dou. 2009c. Applying network calculus for performance analysis of self-similartraffic in on-chip networks. In Proceedings of the 7th IEEE/ACM international conference on Hard-ware/software codesign and system synthesis (CODES+ISSS ’09). ACM, New York, NY, USA, 453–460.

Y. Qian, Z. Lu, and W. Dou. 2010b. Analysis of Worst-Case Delay Bounds for On-Chip Packet-Switching Net-works. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29, 5 (2010),802–815.

Z. Qian, D. Juan, P. Bogdan, C. Tsui, D. Marculescu, and R. Marculescu. 2015. A Support Vector Regres-sion (SVR) based Latency Model for Network-on-Chip (NoC) Architectures. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactions on PP, 99 (2015), 1–1.

Zhiliang Qian, Da-Cheng Juan, P. Bogdan, Chi ying Tsui, D. Marculescu, and R. Marculescu. 2014. A com-prehensive and accurate latency model for Network-on-Chip performance analysis. In Design Automa-tion Conference (ASP-DAC), 2014 19th Asia and South Pacific. 323–328.

Z. L. Qian, D. C. Juan, P. Bogdan, C. Y. Tsui, D. Marculescu, and R. Marculescu. 2013. SVR-NoC: A Perfor-mance Analysis Tool for Network-on-Chips Using Learning-Based Support Vector Regression Model. InACM/IEEE Design Automation and Test in Europe (DATE).

D. Rahmati, S. Murali, L. Benini, F. Angiolini, G. De Micheli, and H. Sarbazi-Azad. 2009. A method forcalculating hard QoS guarantees for Networks-on-Chip. In Computer-Aided Design - Digest of TechnicalPapers, 2009. ICCAD 2009. IEEE/ACM International Conference on. 579–586.

D. Rahmati, S. Murali, L. Benini, F. Angiolini, G. De Micheli, and H. Sarbazi-Azad. 2013. Computing Ac-curate Performance Bounds for Best Effort Networks-on-Chip. Computers, IEEE Transactions on 62, 3(March 2013), 452–467.

F.J. Ridruejo Perez and J. Miguel-Alonso. 2005. INSEE: An Interconnection Network Simulation and Eval-uation Environment. In Euro-Par 2005 Parallel Processing. Lecture Notes in Computer Science, Vol.3648. Springer Berlin Heidelberg, 1014–1023.

B. Ryu and S. Lowen. 2000. Fractal traffic models for Internet simulation. In Computers and Communica-tions, 2000. Proceedings. ISCC 2000. Fifth IEEE Symposium on. 200–206.

S. H. Shahram and L. N. Tho. 1998. Multiple-state MMPP models for multimedia ATM traffic. In Internat.Conf. on Telecommunications (ICT’98)Proceedings. 435–439.

S. H. Shahram and L. N. Tho. 2000. MMPP models for multimedia traffic. Telecommunication Systems 15,3-4 (2000), 273–293.

Zheng Shi and A. Burns. 2008. Real-Time Communication Analysis for On-Chip Networks with WormholeSwitching. In Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on.161–170.

Zheng Shi and Alan Burns. 2010. Schedulability analysis and task mapping for real-time on-chip communi-cation. Real-Time Systems 46, 3 (2010), 360–385.

Simics. 2012. Simics simulator. (2012). http://www.virtutech.com/V. Soteriou, H. S. Wang, and L. S. Peh. 2006. A Statistical Traffic Model for On-Chip Interconnection Net-

works. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MAS-COTS 2006. 14th IEEE International Symposium on. 104–116.

SPLASH-2. 1995. SPLASH-2 Benchmark Suite. (1995). http://www.capsl.udel.edu/splash/

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.

Page 38: A Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis …sld/publications/performanc… ·  · 2016-01-26Performance Evaluation of NoC-based Multicore Systems:

A:38 Z. Qian et al.

C. D. Spradling. 2007. SPEC CPU2006 Benchmark Tools. SIGARCH Computer Architecture News 35 (March2007). Issue 1.

S. Stergiou, F. Angiolini, Salvatore Carta, L. Raffo, D. Bertozzi, and G. De Micheli. 2005. times;pipes Lite:a synthesis oriented design library for networks on chips. In Design, Automation and Test in Europe,2005. Proceedings. 1188–1193 Vol. 2. DOI:http://dx.doi.org/10.1109/DATE.2005.1

L. Thiele, S. Chakraborty, and M. Naedele. 2000. Real-time calculus for scheduling hard real-time systems.In Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Sympo-sium on, Vol. 4. 101–104 vol.4.

Anh T. Tran and Bevan M. Baas. 2012. NoCTweak: a Highly Parameterizable Simulator for Early Explo-ration of Performance and Energy of Networks On Chip.

V. Vapnik. 1998. Statistical Learning theory. John Wiley and Sons.G. Varatkar and R. Marculescu. 2002. Traffic analysis for on-chip networks design of multimedia applica-

tions. In Design Automation Conference, 2002. Proceedings. 39th. 795–800.G. V. Varatkar and R. Marculescu. 2004. On-chip traffic modeling and synthesis for MPEG-2 video applica-

tions. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 12, 1 (2004), 108–119.D. Y. Wang, N. E. Jerger, and J. G. Steffan. 2011. DART: A programmable architecture for NoC simulation

on FPGAs. In Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on. 145 –152.Zhe Wang, Weichen Liu, Jiang Xu, Bin Li, R. Iyer, R. Illikkal, Xiaowen Wu, Wai Ho Mow, and Wenjing Ye.

2014. A Case Study on the Communication and Computation Behaviors of Real Applications in NoC-Based MPSoCs. In VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on. 480–485.

T.F. Wenisch, R.E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J.C. Hoe. 2006. SimFlex: StatisticalSampling of Computer System Simulation. Micro, IEEE 26, 4 (July 2006), 18–31.

P. T. Wolkotte, P. K. F. Holzenspies, and G. J. M. Smit. 2007. Fast, Accurate and Detailed NoC Simulations.In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 323–332.

Wormsim. 2008. Wormsim simulator. (2008). http://www.ece.cmu.edu/~sld/software/worm_sim.phpY. Wu, G. Min, M. Ould-Khaoua, H. Yin, and L. Wang. 2010. Analytical modelling of networks in multi-

computer systems under bursty and batch arrival traffic. The Journal of Supercomputing 51, 2 (2010),115–130.

T. Yoshihara, S. Kasahara, and Y. Takahashi. 2001. Practical Time-Scale Fitting of Self-Similar Traffic withMarkov-Modulated Poisson Process. (2001).

X. Zhao and Z. Lu. 2013. Per-flow delay bound analysis based on a formalized microarchitectural model. InNetworks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. 1–8.

M. Zolghadr, K. Mirhosseini, S. Gorgin, and A. Nayebi. 2011. GPU-based NoC simulator. In Formal Methodsand Models for Codesign (MEMOCODE), 2011 9th IEEE/ACM International Conference on. 83–88.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.