PhD thesis v292webee.technion.ac.il/people/idish/ftp/Zvika-Guz-PhD... · 2010-11-08 · Technion – Israel Institute of Technology Heshvan 5771 Haifa October 2010 . Acknowledgements

THE INTERPLAY OF

CACHES AND THREADS IN

CHIP-MULTIPROCESSORS

Zvika Guz

THE INTERPLAY OF

CACHES AND THREADS IN

CHIP-MULTIPROCESSORS

Research Thesis

Submitted in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

Zvika Guz

Submitted to the Senate of the

Technion – Israel Institute of Technology

Heshvan 5771 Haifa October 2010

Acknowledgements

The research thesis was done under the supervision of Prof. Uri C. Weiser, Prof. Idit Keidar, and Prof. Avinoam Kolodny, in the Faculty of Electrical Engineering.

My pursuit of this Ph.D. has been an amazing journey which would have never been fulfilled without the help and support of my family, advisors, colleagues, and friends.

First, I am deeply grateful to my three Ph.D. advisors– Prof. Uri Weiser, Prof. Idit Keidar, and Prof. Avinoam Kolodny. Your uncompromising guidance, wisdom, and encouragement during difficult times have carried me through this journey. I was inspired by your passion for research and life, and cannot overstate your contribution to my academic and non-academic growth. The decision of having you as my advisors was, by far, the smartest thing I have done in graduate school; thank you.

I thank Dr. Evgeny Bolotin, Prof. Israel Cidon, Prof. Ran Ginosar, Oved Itzhak, Dr. Avi Mendelson, and Dr. Isask'har (Zigi) Walter, for their collaboration and contribution to my research.

I would like to specially thank Prof. Yitzhak (Tsahi) Birk, Dr. Avi Mendelson, and Ronny Ronen, for their continuous support, direction, and good advice during all these years. Their willingness to help and share their wisdom is greatly appreciated.

I have enjoyed the opportunity to work with the members of both the MATRICS research group and Idit's research group, and am grateful for their valuable inputs and fruitful discussions.

I feel very lucky for the many friends I have made during my time at the Technion, and value their encouragement and friendship. Special thanks to Gal Badishi, Aran Bergman, Kirill Dyagilev, Nadav Lavi, and Zigi Walter, for making my years here enjoyable and fun.

Finally, and most importantly, I would like to express my extreme gratitude to my wonderful family. My sister and brothers– Anat, Shachar, and Or, and especially my parents- Mira and Moshe, have supported all my choices and believed in me at every step along the way. Without your long-lasting love and encouragement none of this would have been possible. I am forever indebted to you.

I thank my wife, Liat, for sharing this journey with me. Liat has done everything – from providing feedbacks on my research and proofreading my papers, to lifting me up in tough times and trying to enforce the right balance between research and life. I owe you the completion of this dissertation, and so much more; I love you.

The generous financial help of the Technion is gratefully acknowledged.

Table of Content

Abstract .............................................................................................................................. 1

Abbreviations .................................................................................................................... 3

Chapter 1. Introduction.................................................................................................... 4

1. Research Overview ........................................................................................... 4

2. Summary of Contributions ............................................................................... 8

3. Additional Research ......................................................................................... 9

4. Research Methods ........................................................................................... 10

Chapter 2. The Nahalal Cache Architecture ................................................................ 13

1. Introduction .................................................................................................... 13

2. Related Work .................................................................................................. 17

3. Memory Accesses Characterization ............................................................... 19

4. Cache Organization and Management ............................................................ 21

4.1. The NAHALAL Layout ................................................................................ 21

4.2. Nahalal Cache Management.......................................................................... 23

5. Performance Evaluation ................................................................................. 28

5.1. Methodology ................................................................................................. 29

5.2. Cache Delay and Relative Distance .............................................................. 30

5.3. Overall Performance ..................................................................................... 32

5.4. The Effect of Search ...................................................................................... 34

6. Summary ......................................................................................................... 37

Chapter 3. The Interplay of Caches and Threads in Many-Core Machines ............. 38

1. Introduction .................................................................................................... 38

2. Related Work .................................................................................................. 41

3. The Analytical Model ..................................................................................... 42

3.1. Hardware and Workload Model .................................................................... 42

3.2. Performance and Power Equations ............................................................... 44

4. Performance and Power Curve Study ............................................................. 46

4.1. Performance Curve Study ............................................................................. 47

4.2. Power Efficiency and Performance under a Power Envelope....................... 51

5. Performance Simulation ................................................................................. 54

5.1. Simulator ....................................................................................................... 54

5.2. Workload and System Parameters................................................................. 55

5.3. Results ........................................................................................................... 56

5.4. Performance as a Function of Hardware Parameters .................................... 59

6. Summary ......................................................................................................... 60

Chapter 4. Summary and Future Work ....................................................................... 62

1. Summary ......................................................................................................... 62

2. Future Research Directions ............................................................................ 64

2.1. Accounting for Energy in the Nahalal Architecture...................................... 64

2.2. Nahalal Scalability ........................................................................................ 64

2.3. Extending Nahalal's Principals into New Cache Dimensions ....................... 65

2.4. Enriching the Many-Core Analytical Model and the MTM$im Simulator .. 66

2.5. Eliminating the Performance Valley via Dynamic Morphing ...................... 66

References ........................................................................................................................ 69

List of Figures

Figure 2-1. Nahalal design. ............................................................................................... 16

Figure 2-2. Two cache organizations for an 8-way CMP with one L2 memory bank per

core. ................................................................................................................................... 22

Figure 2-3. Two cache organizations for an 8-way CMP with a heavily banked (64 banks)

L2 cache. ........................................................................................................................... 23

Figure 2-4. Average L2 cache access times for the 64-bank CMP. .................................. 31

Figure 2-5. Breakdown of average distance to shared and private lines for CIM and

Nahalal. ............................................................................................................................. 32

Figure 2-6. Runtime speedup over static line placement for both CIM and Nahalal for

increasing link delays (in clock cycles). ........................................................................... 33

Figure 2-7. Average search time normalized to CIM_par search. ................................... 34

Figure 2-8. Average search time normalized to CIM_par of Nahalal’s parallel search,

hybrid search, and sequential search with a predictor. ..................................................... 35

Figure 2-9. (Delay)·(Number-of-search-transactions) product for Nahalal’s sequential,

hybrid, and sequential-with-predictor search schemes, relative to CIM_par. .................. 36

Figure 3-1. Performance vs. number of threads for benchmarks with different cache hit

rate functions (Phit(s,n)), increasing from 0 (no cache) to 1 (a perfect cache), rm=0.1. .... 48

Figure 3-2. Performance in a limited BW environment for benchmarks with different

percentages of memory instructions (rm), α=7, β=50. ....................................................... 49

Figure 3-3. Performance curve for the 3 types of workloads. .......................................... 50

Figure 3-4. Efficiency metrics: (a) Performance/Power (1/EPI) and (b)

Performance2/Power (1/(energy·delay)) of the unified machine, α=7, β=50. .................. 52

Figure 3-5. Performance under a power envelope of 300 Watts, α=7, β=50. ................... 53

Figure 3-6. Performance vs. number of available threads as extracted from simulation and

as predicted by the analytical model with actual workload parameter values. ................. 57

Figure 3-7. Performance of the blackscholes workload across machines with different

cache sizes. ........................................................................................................................ 60


memory latencies. ............................................................................................................. 60

Figure 4-1. A clustered design for many-core CMP, where each cluster is organized in the

form of Nahalal. ................................................................................................................ 65

Figure 4-2. A conceptual performance plot of a unified machine enhanced with dynamic

morphing. .......................................................................................................................... 67

List of Tables

Table 2-1. Cache line sharing characteristics ................................................................... 20

Table 2-2. Access distribution of shared cache lines ........................................................ 21

Table 2-3. System Parameters ........................................................................................... 29

Table 3-1. Hardware Parameters ....................................................................................... 43

Table 3-2. Workload Parameters ...................................................................................... 44

Table 3-3. Hardware Power Parameters ........................................................................... 46

Table 3-4. Machine Parameters ........................................................................................ 47

Table 3-5. Memory instruction ratios (rm) for PARSEC's workloads .............................. 56

1

Abstract

hile on-die memory sub-systems have long been a principal challenge in

computer architecture, the shift to Chip Multiprocessors (CMP) both greatly

intensifies the gravity of the problem, and also opens the door for new types of solutions.

The increasing requirements, the unique phenomena introduced by the multi-processing

environment, and the new considerations of multi-threading, all drive for a fresh look at

the cache organization and management problem rather than a direct mapping of old

solutions from the single-core world. In this work we tackle a number of important

problems that arise in this context.

In our first contribution we address a new cache organization for multi-core machines

(CMPs with handful of cores). We introduce Nahalal, an architecture whose novel

floorplan topology partitions cached data according to its usage (shared versus private

data), and thus enables fast access to shared data for all processors while preserving the

vicinity of private data to each processor. In Nahalal, (whose topology was inspired by

the layout of the cooperative village Nahalal in Israel), a fraction of the on-die memory

capacity budget is used for hot shared data, and is located in the center of the chip,

enclosed by all processors. The rest of the cache capacity is placed in the outer area of the

die, and provides private storage space for each core.

Via characterization of memory access patterns in typical parallel programs, we show

that the Nahalal topology is especially appropriate for common multi-threaded

applications, as a small subset of the lines in their working set is shared by many cores

and is accessed numerous times during the application's lifetime. Detailed simulations

exhibit significant improvements in cache access latency compared to a traditional cache

design.

In our second contribution we address the interplay of threads and caches in the new

generation of many-core machines. (CMPs with a large number of simple cores.) While

these systems now use both large caches and aggressive multi-threading to mitigate the

off-chip memory access problem, the combination of the two very different approaches

W

2

makes performance prediction challenging and hinders the understanding of the basic

interplay between workloads and architectures and its effect on performance and power.

To address these challenges, we provide a high level, closed-form model that captures

both the architecture and the application characteristics. Specifically, for a given

application, the model describes its performance and power as a function of the number

of threads it runs in parallel, on a range of architectures.

We use the analytical model to qualitatively study how different properties of both the

workload and the architecture affect performance and power. We study both synthetic

and real workloads, backing our analytical model with simulations. Our findings

recognize distinctly different behavior patterns for different application families and

architectures, as well as a non-intuitive "performance valley" shape where machines

provide inferior performance. We study the shape of this "performance valley" and

provide insights on how it can be avoided.

3

Abbreviations

AMAT Average Memory Access Time

CIM Cache In the Middle

CMP Chip Multi-Processors

DNUCA Dynamic Non Uniform Cache Architecture

EPI Energy Per Instruction

GPGPU General-Purpose Graphics Processing Units

MRU Most Recently Used

LLC Last Level Cache

LRU Least Recently Used

NoC Network on Chip

NUCA Non Uniform Cache Architecture

OPS Operations Per Second

SNUCA Static Non Uniform Cache Architecture

4

Chapter 1.

Introduction

In this chapter we provide an overview of our research framework, describe our

publications, and briefly introduce related work. An in-depth review of related work, set

in context of each of our main contributions, is given later in the appropriate chapters.

Likewise, we overview the main research methods used throughout this dissertation and

direct the reader to their in-depth description in the relevant chapters.

1. Research Overview

Chip Multi-Processors (CMPs) [65] are nowadays mainstream, leveraging the

parallelism of multi-threaded applications to achieve higher performance within a given

power envelope. In CMP, several processors are integrated on a single chip, usually

sharing part of the on-die memory resources and output ports to off-chip memory and

peripherals.

Two major trends have helped CMPs become mainstream: Firstly, the transistor

budget is nowadays big enough to enable the implementation of several processors within

a single chip. Secondly, due to the growing complexity of today's microprocessors, there

is a diminishing return in terms of power and area when trying to improve uniprocessor

performance [20] [36]. In this respect, embedding several processors within a single chip

is more cost-effective, as it harnesses the parallelism across different threads to gain

performance, allowing the cores themselves to remain relatively simple and power-

efficient.

While the potential for performance boost in CMP is noteworthy, these systems

present several design challenges among which the on-chip cache memory is of first

precedence. Firstly, CMPs severely stress the on-die memory subsystems, as multiple

threads compete for limited on-die memory resources: Caches now need to cater for

larger storage capacity and bandwidth requirements which are a result of the increase in

5

the number of on-die cores [68]. Moreover, as the memory wall makes off-chip access

prohibitively expensive (both in terms of power and performance) [75], a high cache hit

rate is imperative.

Secondly, as global wire delays become a dominant factor in VLSI design

[2] [20] [36] [38], caches not only need to mitigate the increase in access time, but the

access time itself is not a constant anymore but rather depends on the concrete distance

between the core that fetches the data and the location of that data within the on-die cache

structure [2] [48]. This is called a Non Uniform Cache Architecture (NUCA) [42].

Moreover, since data movement dominates the total energy spent for instruction

execution, vicinity of reference is not only crucial for performance, but also imperative

for energy efficiency [20] [21]. All these factors make the organization and management

of the on-chip cache memories in CMP systems a principal design challenge.

At the same time, the inherent multi-processing environment has profound

implications on the underlined hardware in CMP systems. Firstly, multi-threaded

applications introduce diverse data usage patterns that were not present in uniprocessors.

As we show in Chapter 2, aspects such as inter-thread data sharing and inter-thread

communication can dominate the load experienced by the cache, and thus drive the

architecture into completely different solutions than those used in uniprocessors.

Moreover, as we show in Chapter 3, highly parallelized workloads enable the use of

aggressive multi-threading as an alternative to caches for memory masking. The

combination of both approaches to memory masking gives rise to new phenomena and

introduces complicated design considerations that were not present in the uniprocessor

world.

Overall, while caches have been the subject of a plethora of research, the combination

of unique hardware constrains and a true on-chip multi-processing environment make

caches in CMP a completely new ballgame in which, more often than not, a direct

mapping of existing solutions from uniprocessors will lead to inferior performance. In

this work we target two such cases, where the shift to CMP yields new considerations

over the uniprocessor era, motivating specifically tailored solutions.

6

In Chapter 2, we consider the cache structure in multi-core machines, i.e, CMPs with

handful of cores. While uniprocessor microarchitectures typically partition the cache

based on content (e.g., instructions versus data cache in the Harvard architecture) and

hierarchically (e.g., cache levels: L1, L2, etc.), we argue that in CMPs additional cache

dimensions prove valuable, for example, based on data sharing, data coherency, and other

CMP characteristics. Specifically, we develop the CMP data sharing paradigm, where we

show that it is valuable to distinguish between shared data, which is accessed by multiple

cores, and private data accessed by a single core.

Our characterization of memory accesses patterns in typical parallel programs reveals

that while most of the data in the working set is private, there is a shared hot lines effect

where a small subset of the data in the working set is shared by many cores and is

accessed numerous times during the application's life time. Since in current CMP designs

the last level cache (LLC) is typically located as a bulk in one location (e.g., at the center,

surrounded by all the processors [8] [9] [16] [17] [42]), the shared hot lines effect severely

hinders the overall cache performance, because a substantial fraction of the accesses are

served from locations that are far away from at least some of the sharers, and hence suffer

from long access times.

To mitigate this problem, we introduce Nahalal – a new CMP cache architecture

whose novel floorplan partitions cached data according to its usage (shared versus private

data) and can thus offer vicinity of reference to both shared and private data. In the

Nahalal topology, a small fraction of the on-chip memory is located in the center of the

chip and is designated for shared data only. This shared area is enclosed by the cores,

while the rest of the on-chip memory is placed on the outer ring and is used for private

data blocks. Since all cores are located in vicinity to the designated shared area, the

Nahalal architecture enable fast access to shared data. At the same time, private data is

kept close to its core client, thus preserving fast accesses to private data.

Detailed simulations in Simics demonstrate that Nahalal decreases the shared cache

access latency by up to 41.1% compared to traditional CMP design, yielding performance

gains of up to 12.65% in run time. Since Nahalal's principal benefit is in reducing the

distance a line needs to traverse on a cache hit, it is also advantageous in terms of energy

7

consumption compared to previously suggested cache designs. Our work on the Nahalal

cache architecture, the application characterization, and the performance evaluations was

published in [30] [31].

In Chapter 3, we turn to consider many-core machines, i.e., CMPs with a large number

of simple cores (a few tens to hundreds), and study the interplay between caches, multi-

threading and parallel workloads via a unified, high-level analytical model.

The new generation of many-core machines employs a combination of graphics-

oriented parallel processors with a cache-based architecture. Emerging at first as two

competing design paradigms that targeted the same market of parallel and graphics

workloads (i.e., Many-Core machines that rely on large caches vs. Many-Thread

machines that rely on aggressive multi-threading), new systems now combine both

approaches for masking off-chip memory accesses and use both large caches and

aggressive multi-threading to overcome the memory wall. Intel's Larrabee [71] and Sun's

Niagara [28] are prominent examples of Many-Core machines, using large caches to

reduce the number of accesses to off-chip memory. Nvidia's GT200 [63] and AMD/ATI's

Radeon R700 [4] are true Many-Thread machines, both using aggressive multi-threading

to mask memory accesses by running other threads when some are stalled, waiting for

memory; they manage thousand of in-flight threads concurrently. Lastly, Nvidia’s Fermi

architecture [64], which combines both a last-level cache and numerous in-flight threads,

represent the convergence of the two approaches.

The convergence of the two design paradigm yields a range of throughput-oriented

machines along the Many-Core to Many-Thread range. At the same time, the body of

applications targeted for such throughput-oriented machines continues to enlarge: the

term GPGPU [27] reflects a broadening of the focus to include not only graphics, but also

a wide range of highly-parallel applications. Alas, the range of machines, the

combination of a two very different approaches for memory masking, and the diverse

range of workloads– all complicate performance prediction, impede systematic reasoning

about tradeoffs , and hinder understanding of the basic interplay between workloads and

architectures and its effect on performance and power.

8

We address these challenges by developing a simple, high-level, closed-form model

that captures both the architecture and the application characteristics. The modeled

machine uses a parameterized combination of both mechanisms for memory latency

masking, and can thus capture a range of machines, rendering the comparison between

them meaningful. The workload model, in turn, captures the salient properties of the

program, which allows one to predict which architecture is most beneficial for it. All the

parameters– capturing both architecture and workload– can be used as ''knobs'' for

studying a wide range of scenarios in order to comprehend the interplay among multiple

parameters in a clean, qualitative way. The model can thus serve as a vehicle to derive

intuitions.

We use the analytical model to qualitatively study how different properties of both the

workload and architecture affect performance and power. We study both synthetic and

real workloads, backing our analytical model with simulations. In our findings we

identify distinctly different behavior patterns for different application families and

architectures, and show that for some workloads there exists a "performance valley",

where the machine falls into providing inferior performance. Our analytical model, the

characterization of the performance valley, and the qualitative performance study, were

published in [32] [33].

2. Summary of Contributions

The main contributions of our research are:

- The Nahalal architecture – a novel CMP cache architecture that dynamically

differentiates data according to its usage (shared vs. private) [30] [31].

- A simple closed-form model for systematically reasoning about the complex

interplay of workload and architecture in the new generation of throughput-

oriented many-core engines [32] [33].

- A qualitative study of the effect of various workload and hardware parameters on

performance and power in modern many-core machines, and the discovery of the

distinct valley-like performance shape [32] [33].

9

Our work is described in the following research papers:

1. Z. Guz, I. Keidar, A. Kolodny, and U. C. Weiser, "Nahalal: Cache Organization

for Chip Multiprocessors", IEEE Computer Architecture Letters, vol. 6, no. 1, May

2007.

2. Z. Guz, I. Keidar, A. Kolodny, and U. Weiser, "Utilizing Shared Data in Chip

Multiprocessors with the Nahalal Architecture", the 20th ACM Symposium on

Parallelism in Algorithms and Architectures (SPAA'08), special track on Hardware

and Software Techniques to Improve the Programmability of Multicore Machines,

June 2008.

3. Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, “Many-

Core vs. Many-Thread Machines: Stay Away From the Valley,” IEEE Computer

Architecture Letters, vol. 8, no. 4, April 2009.

4. Z. Guz, O. Itzhak, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Threads

vs. Caches: Modeling the Behavior of Parallel Workloads," the 28th IEEE

International Conference on Computer Design (ICCD-2010), October 2010.

3. Additional Research

In other publications, we look at Network-On-Chip (NoC) design and the services that

NoCs can provide for CMP machines.

NoC [23] [29] [60] is an interconnection infrastructure in which different modules are

connected by a simple network of shared links and routers as opposed to dedicated point-

to-point connections or shared bus. Given their superior scaling properties over

traditional interconnect schemes [13], NoCs are expected to dominate future CMP

design.

Wormhole switching [22] is the common link-level flow control scheme used in NoC.

In wormhole switching, each packet is divided into a sequence of flits which are

transmitted over physical links one by one in a pipeline fashion. A hop-to-hop credit

mechanism assures that a flit is transmitted only when the receiving port has free space in

its input buffer.

10

In [34] [35], we suggest a novel analytical delay model for wormhole-based NoC with

nonuniform links capacities. We then leverage this model to devise an efficient link

capacity allocation algorithm which assign link capacities such that packet delay

requirements for each flow are satisfied. Our suggested capacity allocation algorithm

considerably decreases the cost of the NoC interconnect. It also greatly improves the

speed of the customization process, as it eliminates costly simulations at the inner-loop of

the optimization process.

In [14], we study NoC services for efficient support of shared cache access and cache

coherency in CMPs. We first show how a simple, generic NoC, which is equipped with

needed module interface functionalities, can provide infrastructure for coherent access of

both static and dynamic NUCA. We then show how several low cost mechanisms, based

on priority support embedded in the NoC, can facilitate CMP and boost performance of a

cache coherent NUCA CMP.

Since this line of work is not directly related to the core of our research, we do not

include it in this dissertation but rather list it here only for completeness.

5. Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "Efficient

Link Capacity and QoS Design for Wormhole Network-on-Chip," in Design,

Automation and Test in Europe (DATE), 2006, pp. 9-14.

6. Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “Network

Delays and Link Capacities in Application-Specific Wormhole NoCs,” VLSI

Design, vol. 2007, Article ID 90941, 2007.

7. E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, A. Kolodny, "The Power of Priority: NoC

based Distributed Cache Coherency", in the 1st ACM/IEEE International

Symposium on Networks-on-Chip (NOCS), May 2007.

4. Research Methods

This dissertation addresses different aspects related to the interplay of caches and

threads in CMP. We address both multi-core and many-core machines with different

levels of detail, and consider both current day machines as well as futuristic, abstracted

models. As such, a variety of research methods is used to evaluate the work.

11

When considering a new cache architecture for multi-core machines, we use the

Simics [54] simulator to evaluate the merit of our suggested architecture. Simics is a full

system simulator, capable of booting an unmodified commercial operating system, and

thus enables the study of real-world multi-threaded workloads. We have implemented

both our suggested Nahalal architecture and a traditional Cache-In-the-Middle (CIM)

architecture within Simics, and considered the performance of both architectures over a

wide range of workloads. We study scientific workloads (Splash-2 [76], SPEComp [3])

and commercial workloads (apache [6], zeus [78], SPECjbb200 [73]), as well as two

representative software transactional memory benchmarks (from the RSTM kit [55]).

More details regarding the workloads and the simulation environment are given in

Section 5.1 of Chapter 2.

When profiling the memory access patterns of these workloads we use the PIN

program analysis tool [53]. Pin is a tool for dynamic instrumentation of programs that

enables injecting of arbitrary code into arbitrary locations in the program. Specifically, it

enables injecting profiling procedures for every memory access executed. For profiling,

we use the basic cache module provided with Pin and modify the statistics gathered to fit

our needs. Pin is an order of magnitude faster than Simics and therefore enables

analyzing a larger instruction window than possible in Simics. More details regarding this

approach are given in Section 3 of Chapter 2.

For our work in the domain of many-core machines, we devise a high-level analytical

model of an abstracted engine which is targeted at modeling futuristic machines across

the many-core range. We use the model to conduct a performance and power curve study,

considering both syntactical workloads as well as real-world workloads from the

PARSEC kit [12].

To validate our model and simulate real workloads, we use the in-house MTM$im

simulator, which is built on top of Pin. Unlike the method used for memory access

profiling, here MTM$im is responsible for timing/scheduling of each instruction

according to the modeled machine behavior. Pin is used only to determinate what every

instruction is doing and to provide the instruction stream. More details regarding the

12

MTM$im simulator and the evaluation methodology used for this study are given in

Section 5 of Chapter 3.

13

Chapter 2.

The Nahalal Cache Architecture

In this chapter we introduce Nahalal– a non-uniform cache (NUCA) architecture for

CMP that partitions cached data according to its usage (shared versus private data), and

thus enables fast access to shared data for all processors while preserving the vicinity of

private data to each processor. Via characterization of memory access patterns in typical

parallel programs, we show that such workloads usually present a shard hot line effect,

where a small subset of the lines in the working set is shared by many cores and is

accessed numerous times during the application's lifetime. Thus, such workloads fit the

Nahalal architecture particularly well. We demonstrate the potential of the Nahalal

architecture via detailed simulations in Simics, where we show that Nahalal decreases

cache access latency by up to 41% compared to traditional CMP designs, yielding

performance gains of up to 12.65% in run time.

The related research is described in [30] [31].

1. Introduction

Cache data access is often a principal bottleneck in multi-core machines, where

multiple threads compete for limited on-die memory resources and accessibility. Two

major factors impact the latency and energy consumption of on-chip memory access:

wire delays and contention over the shared resource. Global wire delays are becoming a

dominant factor in VLSI design [2] [20] [36] [38], and hence on-chip cache access time

increasingly depends on the distance between the processor and the location of the data

on-chip. Concurrent access by multiple processors to a shared cache further increases the

access time, as additional delays are incurred for resolving contention on the cache. Some

communication fabric such as a Network-on-Chip [23] [29] [60] is used for

interconnecting the processors to the on-chip cache.

14

Assuming the worst-case distance and latency for every memory access is wasteful.

Hence, CMPs are moving away from monolithic cache structure and are shifting towards

a Non Uniform Cache Architecture (NUCA) [9] [42], where the cache is divided into

multiple banks, and accesses to closer banks result in shorter access times. In NUCA,

performance depends on the average (rather than worst-case) latency, and thus vicinity of

reference become of critical importance - data should reside close to cores that access it.

The division of the cache into multiple banks also allows multiple processors to access

different banks simultaneously, thus reducing contention.

Last-Level Cache (LLC) designs are typically based either exclusively on private

caches or exclusively on shared caches. A private cache is a cache associated with a

single core, while a shared cache is a cache shared among multiple cores. There are

tradeoffs between these two choices. An important advantage of private caches over

shared ones is the proximity of data to the processor that uses it, which yields fast access

times and low energy consumption. On the other hand, private caches entail a static

partitioning of the total capacity among the cores, which may lead to inefficient use of the

cache capacity when the working sets of the different cores vary in size. Moreover,

shared data, which is accessed by more than one core, needs to reside in multiple copies

in private caches, further reducing the effective capacity. In order to manage such

replicated data, a cache coherence protocol needs to be implemented. This complicates

cache management and burdens the system with coherence transactions.

In contrast to private caches, shared caches store only a single copy of each data line,

and thus eliminate the need for maintaining coherence among different copies of the

same data. Storing single copies also increases the overall effective cache capacity. Since

the gap between external memory latencies and on-chip access times continues to grow

[75] [77], such higher cache utilization becomes of particular importance in future

architectures. Despite all of these advantages, shared caches have one critical

shortcoming that renders them inefficient, namely, costly access to shared data [9] [42]. In

shared cache NUCA solutions, shared cache lines inevitably reside far from some of their

client processors, resulting slow and expensive access. (See further details in Section 5.)

15

In order to understand the gravity of this problem, we conducted a study of memory

access patterns in a broad range of multi-threaded applications (See Section 3). Our

findings show that in many multithread applications, a substantial fraction of the memory

accesses involved such shared cache lines [8] [9]. Furthermore, in commercial and

transactional memory workloads, a significant fraction of memory accesses involve

modified-shared data, which cannot be replicated without a performance penalty for

ensuring cache coherence. These findings illustrate the importance of designing adequate

solutions for shared data when designing CMP architectures. This is where current state-

of-the-art solutions (based on either shared or private last level caches) fall short. Our

finding further show that although shared lines consume a substantial fraction of the total

accesses to memory, these lines comprise only a small fraction of the total working set.

Moreover, most of the shared data is shared by all processors. We dub this phenomenon

the shared hot-lines effect.

In Section 4, we leverage the above observations in the design of a novel cache

architecture. We propose Nahalal, a novel CMP cache architecture that partitions the

cache according to the programs’ data sharing, and can thus offer vicinity of reference to

both shared and private data. Nahalal treats shared data as a first class data citizen, and

leverages the best of both the shared and private cache approaches: Like shared caches, it

allows for flexible allocation of cache lines accounting for differences in working set

sizes; it further supports storing a single copy of a shared data line, eliminating storage

waste and the need for cache coherence transactions. Like private cache architectures,

Nahalal preserves the vicinity of reference for processors to both shared and private data.

The Nahalal's topology was inspired by the layout of the cooperative village Nahalal,

shown in Figure 2-1(a), which, in turn, is based on an urban design idea from the 19th

century [41] [47]. In the village Nahalal, public buildings (school, administrative offices,

warehouses, etc.) are located in an inner core circle of the village, enclosed by a circle of

homesteads, and are hence easily accessible to all. Private tracts of land are arranged in

outer circles, each in proximity to its owner's house.

We project the same conceptual layout to CMP, as schematically illustrated in Figure

2-1(b). (A detailed layout is given in Section 4). Generally speaking, in Nahalal, a small

16

fraction of the cache capacity is located in the center of the chip, enclosed by all

processors, while the rest of the memory is placed on the outer ring. The inner memory is

populated by the hottest shared data, and allows for fast access by all processors. The

outer rings create a "private yard" for each processor in the periphery of the chip,

improving vicinity of reference and reducing contention. Thus, the Nahalal design

virtually divides the on-chip cache into two separate caches for "shared" and "non-

shared" data.

(a) Aerial view of Nahalal village [43]. (b) Schema of Nahalal CMP design.

Figure 2-1. Nahalal design. The inner-most memory circle is designated for shared memory.

To demonstrate the Nahalal concept, in Section 4, we implement two design examples

in the context of an 8-way CMP with NUCA-based shared cache. We consider a design

where the cache is partitioned into few large memory banks (one per core) and also a

highly banked architecture which pushes the envelope of the NUCA idea. In Section 5,

we show that these implementations of Nahalal significantly improve L2 cache access

times (compared to previously suggested designs for 8-way CMPs with NUCA). Such

improvements are exhibited over a range of commercial and scientific benchmarks, as

well as a typical transactional memory benchmark. Nahalal improves cache access time

performance by up to 41% compared to traditional CMP designs, yielding performance

gains of up to 12.65% in run time.

17

Beyond this particular example, the Nahalal concept of placing public data where it is

easily accessible by all sharers may be more broadly applicable, and is expected to

benefit performance, reduce power, and improve available bandwidth in various settings.

Furthermore, we believe that the trend towards improved platforms’ performance/power

will drive towards asymmetric architectures, which can use the most appropriate

execution (or storage) element for each task [37] [58]. We see Nahalal’s approach as part

of the overall trend towards asymmetry at the platform level (e.g., asymmetric memory).

Some of these future research directions are outlined in Chapter 4.

In summary, this work challenges the ways in which caches are organized.

Traditionally, uniprocessor microarchitectures partitioned the cache based on content

(e.g., Instructions Cache versus Data Cache in the Harvard architecture) and

hierarchically (e.g., cache levels: L1, L2, etc.). We argue that with CMP architectures,

additional cache dimensions will prove valuable, for example, based on data sharing, data

coherency, and other CMP characteristics. This work develops the CMP data sharing

paradigm.

2. Related Work

Two principal layout alternatives were proposed in recent studies of CMP cache

organization: a number of proposals locate a multi-banked L2 cache at the center of the

chip, surrounded by all processors [8] [9] [16] [17] [42]; several others consider a tile based

architecture [15] [46] [57]. Both of these alternatives are symmetric, that is, all cache

banks have the same function. In contrast, Nahalal’s cache structure is asymmetric, with

the bank (or banks) in the center having a different function than other banks.

Previously proposed designs were generally either based strictly on shared caches [9]

[17] [42], or strictly on private caches [8] [16]. Since each type of cache has inherent

drawbacks (as explained above), these works employ various mechanisms to mitigate

these limitations. In contrast, Nahalal dedicates part of the cache to shared data, and the

rest to private data. Jin and Cho [46] suggested a hybrid solution based on a tile

architecture, where the cache in each tile can be configured as either shared or private by

the operating system. Their approach differs from ours in that the designation of caches

18

as public or private occurs at run-time and relies on software support; therefore, their

physical (hardware) layout cannot optimize for the intended usage as in Nahalal.

The term Non Uniform Cache Architecture (NUCA) was coined by Kim et al. [48], in

the context of uniprocessor systems . They have presented a uniprocessor design where

the cache is divided into multiple banks, and accesses to closer banks result in shorter

access times. Kim et al. have considered both a Static NUCA (SNUCA) approach, where

line placement is static and determined by address, and a Dynamic NUCA (DNUCA)

approach. DNUCA applies dynamic line migration - every access to a cache line moves

the line one step closer to the processor, thus gradually reducing distances and access

times to popular data.

Previous works on CMP cache design have recognized the need for shared L2 caches

that follow NUCA, and suggest cache designs in which access times vary according to

the distance between the data and the client processor [8] [9] [17] [42] [79]. Of those works,

Beckmann and Wood [9] were the first to apply NUCA to CMP systems, devising an

eight node CMP structure where L2 banks are located in the center of the chip and are

shared by eight symmetric processors that surround it. We later use their design as a

comparison reference in the evaluation of our Nahalal architecture. (See Section 5.)

Beckmann and Wood [9] and Huh et al. [42] have studied CMP-DNUCA – a mapping

of the DNUCA approach to CMP, where data lines dynamically migrate towards specific

cores that access them. Both studies have concluded that access to shared data hinders the

effectiveness of DNUCA in CMP, since such data ends up in the middle of the chip, far

from all processors. The Nahalal concept of bringing shared data close to all processors

solves this Achilles heel of DNUCA, and thus provides a platform where DNUCA can

realize its potential in CMP.

Several works have used line replication to ease the shared data problem [8] [17] [79].

Such replication, however, reduces the effective cache capacity, further increasing the on-

chip capacity pressure. Moreover, line replication is only cost-effective when the shared

lines are read-only, since writing entails invalidation of all copies which may impact

performance. Nahalal reduces the need for replication by allowing a single copy to reside

close to all the processors that share it.

19

Beckman et al. [8] [9] have studied memory access patterns in CMP, identifying the

imbalance between the number of accesses to shared blocks and the actual number of

shared blocks in the working set, and pointed out the importance of shared blocks to

overall memory performance. Our memory accesses characterization (see Section 3)

continue this line of work, identifying the same trends for additional workloads and

making additional observations.

The work of Liu et al. [52] is closely related to our research. Liu et al. were the first to

suggest adding a central cache cell to a highly-banked DNUCA-based CMP, in order to

improve access times to shared data. Our work takes a more general approach,

emphasizing the importance of architecturally differentiating cached data according to its

usage. We study the implementations of the Nahalal concept under two different design

choices (heavily-banked and one bank per core), and elaborate on aspects of the

implementation such as several alternatives for searching cache lines and the ability to

predict line locations. Our evaluation examines the effect of these issues on performance

and power, and studies a wider range of applications, including transactional memory.

3. Memory Accesses Characterization

In this section we study the memory access and sharing patterns occurring in various

standard multi-threaded workloads. We consider three types of benchmarks - scientific

programs from the SPEComp [3] and Splash2 kits [76], commercial benchmarks (the

apache web server [6], the zeus web server [78], and SPECjbb [73]), and software

transactional memory benchmarks (the RTSM kit [55]). More details about the workloads

are given in Section 5.1.2. We use the Pin program analysis tool [53] to profile accesses

to each cache line of 64 bytes (either L1 or L2) throughout the program execution,

emulating a scenario where the program is run on CMP of 8 nodes. (Similar approach for

studying L2 accesses was used in [8]). We consider a line to be shared if multiple cores

access it within a window of 10 million instructions. The profiling results, summarized in

Table 2-1 lead to several observations.

20

Table 2-1. Cache line sharing characteristics

Sample Benchmarks Shared lines Modified shared lines

Percentage

out of all

accesses

Percentage

out of all

lines

Percentage

out of all

accesses

Percentage

out of all

lines

SPEComp equake 32.05 0.73 2.78 0.40

fma3d 8.93 0.16 0.37 0.14

Splash2 barnes 15.36 7.07 3.14 0.61

water 24.90 11.96 17.55 10.85

Commercial apache 58.25 34.33 47.91 25.26

zeus 56.85 37.76 41.64 28.16

specjbb 44.24 13.78 15.39 0.89

RSTM RBTree 88.71 53.85 83.46 50.11

HashTable 86.49 58.48 81.13 54.52

First (as can be seen in the third column of Table 2-1), in many workloads, accesses to

shared data comprise a substantial fraction of the total memory accesses (e.g., up to

58.25% in the apache workload and a whopping 88.71% in the RBTree benchmark).

Moreover (as can be seen in the fifth column of Table 2-1), in commercial workloads,

many of these accesses involve shared lines that are modified by at least one of the

sharers (e.g., in apache, 82% of the memory accesses to shared data are to modified

shared lines). This phenomenon is also true for the transactional memory benchmark, in

which more than 90% of the accesses to shared data are to modified shared lines.

Second (as can be seen when comparing the values in the third and forth columns in

Table 2-1), there is a clear discrepancy between the number of accesses to shared lines

and the number of lines shared in the working set: a small number of cache lines, shared

by many processors, accounts for a significant fraction of the total accesses to memory.

We dub this phenomenon the shared hot lines effect. Notice that this phenomenon holds

also when considering only modified shared lines. (The last two columns in Table 2-1.)

Furthermore, we observe that a small number of shared lines - some very hot lines- are

more popular than others, accounting for most of the accesses to shared data. This

phenomenon is shown in Table 2-2. Each column shows, for a given cache size, what

percentage of the accesses to shared data are to a working set of this size. As can be seen

21

from the table, a small fraction of the lines is responsible for the majority of the accesses.

For example, in equake, the most popular 1MB of shared data accounts for 97.59% of the

accesses to shared data.

Table 2-2. Access distribution of shared cache lines

Sample Benchmarks Percentage out of all access to shared data

0.5MB 1MB 1.5MB 2MB

SPEComp equake 96.87 97.59 98.23 98.82

fma3d 99.89 99.92 99.95 99.97

Splash2 barnes 93.44 96.67 98.83 99.50

water 100 100 100 100

Commercial apache 80.37 89.79 94.08 96.67

zeus 84.38 90.75 93.82 95.99

specjbb 99.98 100 100 100

RSTM RBTree 98.21 98.49 98.69 98.87

HashTable 97.95 98.27 98.50 98.69

We have also found that typically the same data is shared throughout the program's

lifetime; and that shared data is typically shared by many processors. Together, our

observations indicate that reducing the access latency to a modest number of shared data

lines can have a significant impact on CMP performance; and that a small subset of the

memory capacity suffices for holding the shared lines to which the majority of accesses

are made.

4. Cache Organization and Management

We now discuss a possible realization of the Nahalal concept in the context of an 8-

way CMP. We first (in Section 4.1) present Nahalal's cache organization, and then follow

(in Section 4.2) to discuss cache management issues, concentrating on lines placement

and line migration.

4.1. The NAHALAL Layout

The main concept in Nahalal is placing shared data in a relatively small area in the

middle of the chip, surrounded by the cores, and locating private data in the periphery.

22

This solution is feasible thanks to the shared hot lines phenomenon described in Section

3, which suggests that a relatively small structure is sufficient for serving the majority of

shared data accesses. We implement this concept in two typical 8-way CMP architectures

– one with 8 memory banks (one per core), as suggested, e.g., in [8] [16], and one 64-

bank architecture, pushing the envelope of the NUCA idea, as suggested in [9]. In both

cases, we compare Nahalal to a traditional Cache In the Middle (CIM) layout.

Figure 2-2(a) depicts a typical Cache In the Middle (CIM) organization for an 8-way

CMP with one L2 bank in the proximity of each core [8]. Figure 2-2(b) portrays our

alternative layout based on the Nahalal concept. In both designs, each core has a private

L1 cache, and the L2 cache is banked. The L2 cache capacity is partitioned among the

cores so that each core has one cache bank in its proximity (depicted in light grey). In

Nahalal, some of the total L2 cache capacity is designated for shared data, and is located

in the center (depicted in dark grey). Figure 2-3(a) shows a suggested layout for the more

aggressive DNUCA with 64 banks [9], and Figure 2-3(b) shows a Nahalal layout for the

same number of banks.

(a) Cache In the Middle (CIM) layout. (b) Nahalal layout.

Figure 2-2. Two cache organizations for an 8-way CMP with one L2 memory bank per core.

CPU0 CPU1 CPU3 CPU2

CPU5 CPU7 CPU6

CPU0 CPU1 CPU3 CPU2

CPU5 CPU7 CPU6

$Bank 0 $Bank 1 $Bank 2 $Bank 3

$Bank 4 $Bank 5 $Bank 6 $Bank 7

CPU4

CPU0

CPU3 CPU7

CPU2

CPU6 CPU4

CPU3 CPU7

CPU6

$Bank 0 $Bank 1

$Bank 2

$B

ank 3

Shared $Bank

$B

an

k 7

$Bank 6 $Bank 5

$Bank 4

CP

U5

CP

U1

CPU0 CPU2

CPU4

23

(a) Cache In the Middle (CIM) layout. (b) Nahalal layout.

Figure 2-3. Two cache organizations for an 8-way CMP with a heavily banked (64 banks)

L2 cache.

4.2. Nahalal Cache Management

A broad range of potential cache management strategies can be implemented given the

layout depicted in Figure 2-2 and Figure 2-3. Particularly, the management of CMP-

DNUCA based caches has been studied in several pervious works [9] [42] [69]. These

mechanisms are, by-and-large, agnostic to the cache layout topology, and can be used in

Nahalal. In this section, we present one possible cache management policy, based on

shared L2s, for each of the four layouts (CIM and Nahalal, 8 or 64 banks). We

implemented these policies in Simics [54] and experimented with them as reported in the

next section. The CIM management policies follow the ones proposed in [9]. Nahalal's

management closely follows that of CIM, with the necessary adaptations. We now briefly

discuss the relevant details for the sake of completeness, while elaborating on issues that

are unique to the Nahalal design.

We use the MESI coherence protocol [7] and enforce inclusion between the shared L2

and all L1 caches (as in [9] [42]). Each L2 line's tag includes a sharing status vector

comprised of a bitmask indicating which L1 has a valid copy and a dirty bit that indicates

if one of the higher level caches has a modified version. When a processor modifies an

L1 copy, an update message is sent to the L2 cache, which in turn sends invalidation

CPU1 CPU2

CPU6 CPU5

CP

U7 CP

U3

CP

U4

CP

U7 CPU1CPU1 CPU2CPU2

CPU6CPU6 CPU5CPU5

CP

U7

CP

U7 CP

U3

CP

U3

CP

U4

CP

U4

CP

U7

CP

U7

24

messages to all the other L1 caches that hold the line (a similar approach was used in

[9] [42]).

In all layouts, we focus on a shared cache paradigm, where all processors can access

all banks of the L2 cache. Each address can be located in multiple banks (although only

one copy of a line may be present in the cache in any specific time). Thus, cache

management needs to decide where to place the line when it is first fetched, and

subsequently, if and when to migrate a line from its current bank, and to where (to

another bank or to be evicted from the cache). Placement and migration of cache lines are

discussed in Section 4.2.1 (for the case of one bank per processor) and Section 4.2.2 (for

the heavily-banked case). In addition, a search mechanism is required in order to discover

cache lines in the banks where they reside. Search is discussed in Section 4.2.3.

4.2.1. Placement and Migration – One Bank per Processor

In both implementations of Figure 2-2, (CIM and Nahalal), when a line is first fetched,

it is placed in the bank adjacent to the processor that made the request. In the

implementation of the 8-bank CIM, the line remains in its initial location as long as it is

not evicted from the cache. In contrast, Nahalal uses migration in order to divert shared

hot lines to the cache bank at the center of the chip. To prevent pollution of the center

bank with unpopular lines, a line is migrated to the center bank only after N accesses (for

some threshold N) from different cores. In our implementation, the threshold is set to 8

accesses, and the policy is implemented by adding a 3-bit counter to each cache line.

The central bank is managed using an LRU eviction policy; that is, when space needs

to be made for a new line, the least recently used line (LRU) among all cache lines in the

center is evicted. However, the victim line is not evicted from the entire L2 cache.

Instead, the locations of the two lines (the one moving to the center and the evicted one)

are swapped.

Since saving usage time statistics for all the lines in the center may be costly in terms

of hardware complexity, we organize the central area as an 8-way cache structure,

tracking LRU statistics over each set. Such a structure is more feasible in terms of

hardware complexity, while still keeping the most used shared lines in the center.

25

4.2.2. Placement and Migration – Heavily Banked DNUCA

In CMPs with many cache banks, data typically migrates among banks at run-time

[9] [42]. Banks are partitioned into groups called banksets, so that the members of each

bankset are distributed in all areas of the chip. Thus, for a given core, each bankset

includes banks residing at various different distances from the core. Every data line is

mapped to exactly one bankset according to its address, and may reside in any bank

pertaining to the designated bankset. When a processor accesses a cache line that already

resides in one of the banks, the line can migrate to another bank that belongs to the same

bankset and is located closer to the processor. In our implementations, each bankset is

comprised of 16 banks, and thus each line has 16 possible locations, as in a 16-way

cache. Similar settings were used in previous studies of heavily banked DNUCA [9][41].

In the CIM layout of Figure 2-3(a), we implemented the approach of Beckmann and

Wood [9], whereby migration is gradual. When a core accesses a data line that resides in

a remote bank, that data line is not immediately transferred to the vicinity of the

requesting processor. Instead, the data line makes a single step towards the processor—

the line is moved to the bank closest to its current location among the banks in the same

bankset that reside closer to the processor. The line is swapped with one of the lines in

the chosen bank, creating a process that resembles a bubble sort.

We follow a similar strategy in the Nahalal design of Figure 2-3(b), except for the

special treatment of shared data. To identify shared data, we use the sharing status vector

of the cache coherence mechanism. (Note that although we do not store multiple copies

of the same data line in the L2 cache, multiple copies may still reside in the processors'

L1 caches. Hence, a cache coherence mechanism is employed for L1 cache

management.) When a line is accessed, we first check the sharing status vector; if more

than one bit is set, the line is deemed shared, and is migrated to one of the banks in the

center of the chip, (unless it is already there). If the line is not shared, it is migrated

towards the requesting processor’s private area.

26

4.2.3. Search

All of our implementations are based on the DNUCA paradigm, where a given cache

line may reside in multiple banks, and its location in the cache is determined at run-time.

This raises the question of how to search whether or not a requested line is in the cache,

and if it is, where it is located. We now describe several alternatives for doing so.

The fastest way to search is to send queries to all the relevant banks in parallel (to all

banks in the 8-bank case, and to the ones pertaining to the appropriate bankset in the 64-

bank case), and have a bank that contains the requested line respond. If no bank responds

within an appropriate timeout, it can be concluded that the line is not in the cache. While

this parallel search allows lines to be located very quickly, it is also highly inefficient in

terms of power dissipation. Furthermore, it burdens the on-chip interconnect with many

requests.

In order to conserve energy and reduce the interconnect's load, it is preferable to

stagger the search, and thus reduce the number of queries sent in case the line is found

early. The extreme version of this approach is a sequential search, in which the requesting

processor checks the relevant banks one at a time. The processor checks banks in

increasing order of their distances from it, starting from the closest bank. The search

continues until either the line is found or all relevant banks have been searched.

Sequential search was implemented for CIM layouts in [9] [42].

In Nahalal, there are two relevant banks at approximately the same distance from the

processor— one local, i.e., in the processor’s “private back yard”, and one in the center.

This raises the question where to begin the search. In some benchmarks more accesses

are made to private data, while in others (most notably commercial ones), accesses to

shared data exceed those to private data. (See Section 3.) In order to reduce the load on

the shared central structure, Nahalal’s sequential search first checks the requesting

processor's closest relevant local bank. If the line is not found there, the processor checks

the relevant bank (or banks) in the center of the chip, and then checks all other relevant

banks as needed, in order of their distance from it.

In both the CIM and Nahalal layouts, sequential search may take a long time to

complete, the worst case occurring when all banks are searched. However, we observe

27

that Nahalal has an advantage in average search time (as well as average power

dissipation) thanks to its more predictable placement of shared lines, as we now explain

and is confirmed in simulations in the next section. Consider, for the remainder of this

section, the 8-bank design (the case of 64 banks is similar; for clarity of the exposition,

we focus on one design). When a frequently used private cache line is accessed, either in

Nahalal or CIM, it is found by the first search query in the requesting processor’s local

bank. However, if a frequently used line is not found by the first query, the line is most

likely shared, and hence resides elsewhere in the cache. In this case, with the CIM

approach, it is equally likely for the line to reside in each of the other seven banks.

Therefore, it takes an average of 3.5 additional queries to locate the line. In contrast, in

Nahalal, the line is most likely located in the center, and will thus be found by one

additional query.

There is a tradeoff between the increased latency of the sequential search and the

higher power costs for the parallel search. This predictable placement of shared lines in

Nahalal allows us to balance these two considerations. We devised a hybrid search

approach, whereby two banks are first searched in parallel, namely the processor’s local

bank and the center bank. If neither query locates the requested line, the search continues

sequentially. The hybrid approach sends at most one query more than the sequential

search, and a superfluous query is sent only in case the line is in the local cache. In

contrast, parallel search may send up to 7 superfluous queries, and sends superfluous

queries in almost all cases (except when the entire cache needs to be searched). In terms

of performance, the hybrid approach can resolve most searches within one step, since

most accesses are made to lines that reside either in the local bank or in the center bank.

Though it provides a good tradeoff between power and performance, the hybrid

approach still suffers from two drawbacks. First, although its power dissipation is modest

compared to that of parallel search, due to the increasing importance of energy saving in

modern architectures, it is undesirable to expend even this modest cost. Second, the

hybrid approach queries the central bank for every access, which may create excessive

contention among the cores at the center. In order to mitigate both of these shortcomings,

we implemented a simple predictor at each core, which records the last 1K lines that

were fetched from the center. Nahalal’s sequential search with predictor proceeds as

28

follows. The processor first checks if the accessed line is stored at the predictor. If it is,

the line is first searched in the center bank. Otherwise, the line is first searched in the

local bank. In both cases, if the first query fails to locate the line, the search continues

sequentially. For a modest overhead of 0.56% of the total cache capacity (given the cache

sizes simulated in the next section), the sequential search with predictor achieves almost

the same performance as the hybrid approach, with the minimal energy cost of the

sequential approach.

Finally, for CIM, such a simple predictor is not feasible, because there are more than

two plausible locations for each cache line. However, it is possible for each core to track

the exact locations of frequently used cache lines. Such an approach has been

implemented by Ricci et al. [69] in the context of highly banked DNUCA.

5. Performance Evaluation

In this Section we evaluate the performance of the Nahalal architecture using a

detailed system simulation in Simics [54]. We first describe, in Section 5.1, our

simulation methodology, evaluation environment and parameters, and the benchmarks

used. We then proceed to present our results.

Nahlal's principal benefit is in reducing the distances between processors and their

data. In order to isolate the impact of this phenomenon from secondary artifacts like

search time, we first run experiments using parallel search for all layouts (Sections 5.2

and 5.3). This scenario actually favors the CIM design, since the more energy-efficient

search approaches work faster with Nahalal than with CIM, as we show in Section 5.4,

where we study the effect of the different search mechanisms on both performance and

energy. In Section 5.2, we measure cache access delays and the distances between

processors and their data - the direct artifacts of the Nahalal layout. In Section 5.3, we

examine Nahalal's impact on overall performance, as well as performance trends under

increasing wire delays.

29

5.1. Methodology

5.1.1. Simulation Setup

Table 2-3. System Parameters

Parameter Value

8-bank 64-banks

L1,L2 line size 64B, 64B

L1 caches size, ways, access 32KB, 2-way, 3 cycle

Main memory access 300 cycles

L2 cache size 16MB

Bank size 2MB

(1.785 in Nahalal)

256KB

L2 bank access time 15 cycles 6 cycles

Link delay between adjacent banks 5 cycles 2 cycles

To demonstrate the potential performance gain of the Nahalal topology, we

implemented the four 8-processor CMP design examples of Figure 2-2 and Figure 2-3 in

the Simics [54] full-system simulator. Our simulations parameters closely follow those of

previous work [9], wherever applicable. The processors are implemented using the x86

in-order processor model. In all designs, each processor has a private 32KB L1 cache,

and the processors share a 16MB L2 cache.

In the CIM 8-bank layout of Figure 2-2(a), each core is adjacent to a 2MB cache bank;

in the corresponding Nahalal topology (Figure 2-2(b)), each processor has a 1.875MB

bank in its “private back yard”, and the central bank's capacity is 1MB. In the 64-bank

case (Figure 2-3), each bank's capacity is 256KB. Access times are shorter for small banks

(6 cycles) than for large ones (15 cycles). The processor and banks are interconnected via

a Network-On-Chip (NoC) [23][26] [60], and thus the time to access a line is composed

of the bank access time, and the time required to traverse all links across the path

between the processor and the bank hosting the line. (Each link's bandwidth is assumed to

be infinite, and no contention is modeled at the routers. Thus, the delay incurred by the

network is proportional to the number of hops in the network between the source and the

destination.) We assume that the larger banks of the 8-bank CMP (Figure 2-2) incur

larger distances between the bank themselves, and hence require more time to traverse

30

compared to the heavily-banked CMP design of Figure 2-3. (5 cycles and 2 cycles

respectively.) All systems parameters are summarized in Table 2-3.

5.1.2. Benchmarks

In order to evaluate Nahalal over a broad range of applications, we study three

families of benchmarks. First, we run sample scientific benchmarks from the Splash-2

[76] and SPEComp [3] kits. Second, we experiment with three commercial workloads

(apache [6], zeus [78], and SPECjbb2000 [73]). For the web benchmarks, we use the

SURGE [6] toolkit to generate a representative workload of web requests from a 30,000-

file, 700MB repository with a zero backoff time. This toolkit generates a Zipf distribution

of file accesses, which was shown to be typical of real-world web workloads. Finally, we

study two representative software transactional memory benchmarks, HashTable and

RBTree, from the RSTM [55] kit; we give an equal probably for insert, delete, and search

in each data structure.

All workloads are run within the simulated system on top of Mandrake 10.1 Linux

with SMP kernel version 2.6, custom-compiled to interface with Simics. For all

workloads, we fast-forward through the serial part, and then perform measurements in the

parallel part of the code, which is the most interesting for our purposes. In all workloads

of the SPEComp and RSTM kits, the parallel part itself consists of loops that are iterated-

through hundreds or thousands of times (in the SPEComp benchmarks), or even millions

of times (in case of apache and zeus). All iterations have very similar characteristics. We

therefore run only some of the iterations (hundreds in apache and zeus).

5.2. Cache Delay and Relative Distance

Figure 2-4 presents average L2 cache access times for Nahalal and CIM in various

benchmarks. Nahalal achieves superior results in all benchmarks, regardless of the

number of banks. It reduces the average L2 cache access time compared to CIM by an

average (over all benchmarks) of 26.8% and 26.2% for the 8-bank and 64-bank cases,

respectively. The most significant improvement, of 41.1% for 8 banks (37.2% for 64

banks), is obtained for the apache benchmark.

31

Nahalal’s faster average access time stems from faster access to shared data, as shown

in Figure 2-5. In all layouts, most of the private data is located in the local banks of each

processor. But while Nahalal is able to serve most of the accesses to shared data from the

center of the chip, in CIM layouts, most accesses to shared data go to remote banks, thus

suffering long access times. Consequently, Nahalal benefits from short average access

time even for benchmarks with many accesses to share data.

(a) 8-bank CMP

(b) 64-bank CMP

Figure 2-4. Average L2 cache access times for the 64-bank CMP; labels indicate Nahalal's

reduction in L2 hit time compared to CIM.

0

5

10

15

20

25

30

35

40

45

50

equake fma3d barnes water apache zeus specjbb RBTree HashTable

Cache A

ccess T

ime (clo

ck c

ycle

s)

CIM

NAHALAL

3.9%

8.6%

40.5%

39.4%

41.1%

29.1% 29.4%

25.1% 24.2%

0

5

10

15

20

25

30

35

40

45

50


Cache A

ccess T

ime (clo

ck c

ycle

s)

CIM

NAHALAL

9.5% 9.7%

31.9%

23.7%

37.2% 34.8%

19.3%

35.8% 34.1%

32

(a) 8-bank CMP

(b) 64-bank CMP

Figure 2-5. Breakdown of average distance to shared and private lines for CIM and

Nahalal; distance is measured in hops, (which is the number of intermediate banks on the

path to the data).

5.3. Overall Performance

Nahalal achieves an average speedup of 7% in total execution time compared to CIM.

The best speedup is obtained for commercial benchmarks - 9.32% and 12.65% for zeus

and apache respectively for the 8-bank case (13.98% and 10.31% for the 64-bank case).

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3


Avera

ge D

ista

nce (N

um

ber

of H

ops)

Avera

ge D

ista

nce (

Num

ber

of

Hops)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8


Avera

ge D

ista

nce (

Num

ber

of

Ho

ps)

private line in CIM

private line in NAHALAL

shared line in CIM

shared line in NAHALAL

33

On transactional memory benchmarks, the speedup is 6.42% and 7.42% (for the RBTree

and HashTable, respectively). Nahalal is most advantageous when there are many

accesses to shared data in L2. Thus, for benchmarks with a low L2 access rate (e.g.,

barnes), where L2 is not the bottleneck, or for benchmarks with almost no sharing, (e.g.,

fma3d), Nahalal’s improvement in overall performance is more modest. In the future, one

can expect CMP applications to have larger memory demands and exhibit more sharing,

while growing wire delays will increase the importance of locality of reference. Hence,

Nahalal’s benefit will become more significant.

Figure 2-6 projects the importance of Nahalal in future architectures, where wire

delays will become more dominant. The figure demonstrates the effect of increasing wire

delays on performance for the apache benchmark (in an 64-bank design). We consider a

reference system where each line's location is determined statically, according to its

address (this is called Static NUCA, or SNUCA). We depict the speedup in runtime of

both CIM and Nahalal over the reference system as the per-hop link delay increases.

Although both systems exhibit more speedup as the wire delay increases, the relative gain

of Nahalal grows more as technology scales. This is because distance-related delay

becomes dominant, and Nahalal is more effective in reducing the average access

distances. Overall, the Nahalal solution is more scalable as wire delays become dominant.

Figure 2-6. Runtime speedup over static line placement for both CIM and Nahalal for

increasing link delays (in clock cycles). The results are given for the apache benchmark, in

the 64-bank cache design. Nahalal's performance gain increases as the wire delay becomes

more dominant.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 1 2 3 4 5 6 7 8 9

Speed

up

over

SN

UC

A

Link delay (clock cycles)

DNUCA

Nahalal

NAHALAL

CIM

34

While the number of cores that can reside in physical proximity to a shared cache is

limited, we note that this limitation is even more severe in the CIM layout, where shared

data ends up in the middle, and its distance from all cores grows as the number of cores

increases. Moreover, given that Nahalal features a single designated cache for hot shared

data, one can effectively mitigate this limitation by investing more resources in the

designated shared cache, e.g., by laying out direct fast wires from all cores to the

designated cache, by using stronger drivers, and by implementing multiple ports.

5.4. The Effect of Search

Figure 2-7. Average search time normalized to CIM_par search.

We now turn to consider more energy-efficient search mechanisms. Our baseline for

comparison is parallel search in CIM layout (denoted CIM_par). Figure 2-7 compares the

average search times of parallel and sequential search in Nahalal (Nahalal_par and

Nahalal_seq, respectively), and sequential search in CIM (CIM_seq), normalized to the

search time of CIM_par. Nahalal_par is faster than CIM_par because closer data is

located faster, and as we saw above, Nahalal’s average distance to data is shorter. In both

cases, sequential search is more energy efficient than parallel search, but also slower—

by 44.5% and 37% for CIM and Nahalal, respectively. The penalty for sequential search

is smaller in Nahalal thanks to the more predictable location of frequently accessed

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

equake fma3d barnes water apache zeus specjbb RB Tree HashTable average

Av

era

ge

Se

arc

h T

ime

(N

orm

alize

d t

o C

IM_

pa

r)

NAHALAL_par

CIM_seq

NAHALAL_seq

35

shared data, as explained in Section 4.2.3. Thanks to the closer location to data along

with the smaller penalty for using sequential search, Nahalal_seq’s average search time is

only between 1% (for equake) and 40% (for zeus) longer than that of CIM_par. In

contrast, CIM_seq can be up to 129% slower than CIM_par (e.g., barnes).

Next, we consider two search schemes that balance the tradeoff between power and

latency: (1) Nahalal_hybrid, which first searches the local and central banks in parallel,

and then proceeds with sequential search; and (2) Nahalal_seq_predictor, sequential

search augmented with a predictor in order to decide which of the two nearest banks to

search first (local or central). The normalized average search times of these two for

various benchmarks appear in Figure 2-8. We observe that Nahalal_hybrid’s search time

is only 13% slower than that of Nahalal_par, on average. Finally, Nahala_seq_preditor,

which consumes even less energy than Nahalal_seq (because it searches in the correct

place in the first step more often), is nearly as fast as Nahalal_hybrid (only 10% slower

on average). Recall that our predictor merely saves the last 1K lines that were served

from the center bank, and its overhead is thus negligible.

Figure 2-8. Average search time normalized to CIM_par of Nahalal’s parallel search,

hybrid search, and sequential search with a predictor.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

equake fma3d barnes water apache zeus specjbb RB Tree HashTable average

Av

era

ge

Se

arc

h T

ime

(N

orm

alize

d t

o C

IM_

pa

r)

NAHALAL_par

NAHALAL_hybrid

NAHALAL_seq_predictor

36

Finally, we study the efficiency of each scheme using a combined power and

performance metric. More specifically, we multiply the number of search queries sent in

each fetch by that fetch’s search time. The product, denoted delay ⋅ search transactions, is

low when the search is most efficient. The results for Nahalal’s four search schemes are

presented in Figure 2-9; the results are again normalized to CIM_par. As expected,

sequential search (CIM_seq and Nahalal_seq) performs well when most of the data is

private, since it hits the right bank in the first step (e.g., in equake and fma3d). When the

percentage of accesses to shared data increases, CIM_seq degrades drastically, since it

requires an average of 4.5 searches to find shared data. Nahalal_seq also degrades, but to

a much smaller extent, since shared data is typically found within 2 steps. It remains

superior to CIM_par in all benchmarks. Nahalal_hybrid better balances the energy versus

latency tradeoff, when shared data is involved. Finally, Nahalal_seq_predictor exhibits

the best results over all benchmarks, outperforming the hybrid search by 25.7% on

average. We conclude that the (modest) 1K overhead needed for Nahalal_seq_predictor is

well worth its benefits; this predictor-based sequential search is the most appropriate

scheme for future CMPs, which need to take into consideration both performance and

power costs.

Figure 2-9. (Delay)·(Number-of-search-transactions) product for Nahalal’s sequential,

hybrid, and sequential-with-predictor search schemes, relative to CIM_par.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

equake fma3d barnes water apache zeus specjbb RBTree HashTable average

Re

lati

ve

De

lay

* R

ela

tiv

e N

um

be

r o

f T

ran

sa

cti

on

s CIM_seq

NAHALAL_seq

NAHALAL_hybrid

NAHALAL_seq_predictor

37

6. Summary

The on-chip memory system is becoming a performance bottleneck in chip

multiprocessors. Therefore, cache architectures should be specifically adapted and

optimized to the emerging multi-processing environment. In this work, we proposed

partitioning the shared on-chip cache according to the level of data sharing. This

approach is motivated by the observation that, in many multithreaded applications, a

small set of shared cache lines accounts for a significant portion of the memory accesses.

We have leveraged the shared hot-lines phenomenon to devise Nahalal - a new CMP

topology that locates shared data close to all sharers and still preserves vicinity of private

data for all processors. We have demonstrated the potential of the Nahalal concept via

two CMP design examples, a shared cache with one L2 bank per core, and a heavily-

banked shared cache leveraging NUCA. In both cases, Nahalal greatly improves average

cache access times, with savings up to 41.1%. We have considered several searching

schemes to locate a line within the cache and have shown that the Nahalal design allows

for more efficient search in terms of power and performance than traditional CMP

layouts. Moreover, we presented a predictor-based search scheme that achieves high

performance at a very low power cost.

While this chapter presents a specific optimization of CMP cache architectures,

namely optimizing shared data access, it fits within the broader picture of re-thinking

traditional design, for example, by breaking symmetry. While we focused on partitioning

according to data sharing, we argue that other dimensions will also prove valuable in

CMP systems. We discuss this in length in Chapter 4.

38

Chapter 3.

The Interplay of Caches and

Threads in Many-Core Machines

In this chapter we present our study on the interplay between caches, multi-threading,

and parallel workloads in the new generation of throughput-oriented many-core

machines. These high-performance engines now combine a large number of graphics-

oriented parallel cores, aggressive multi-threading capabilities, and cache architecture.

They target an enlarging body of highly-parallel workloads, no longer restricted to only

the graphic domain.

Our work provides a new model for capturing the behavior of such parallel workloads

on different many-core architectures. Specifically, we provide a simple analytical model,

which, for a given application, describes its performance and power as a function of the

number of threads it runs in parallel, on a range of architectures.

We use the analytical model (backed by simulations) to qualitatively study how

different properties of both the workload and architecture effect performance and power.

We recognize distinctly different behavior patterns for different application families and

architectures, and identify a valley-like behavior where machines deliver inferior

performance. We study the shape of this “performance valley” and provide insights on

how it can be avoided.

The related research is described in [28] [32].

1. Introduction

Nowadays, throughput-oriented CMP machines– GPUs and similar accelerators– are

becoming increasingly popular. Such engines serve the mounting computation needs of

high-throughput and graphics-processing applications. Meanwhile, the body of

applications targeted for such throughput-oriented machines continues to enlarge: the

39

term GPGPU [27] reflects a broadening of the focus to include not only graphics, but also

a wide range of highly-parallel applications.

As memory access is a principal bottleneck in current-day computer architectures [75],

a key enabler for high performance in these throughput-oriented machines is masking the

memory overhead. Today’s high-performance engines employ two design principles to

overcome memory related issues: The first is based on a cache architecture that takes

advantage of locality of references to memory. Intel's Larrabee [71] and Sun's Niagra

[28] are prominent examples of this approach. The second approach uses aggressive

multithreading so that whenever a thread is stalled, waiting for data, the system can

efficiently switch to execute another thread. This approach is heavily used in current

graphics processing engines such as Nvidia's GT200 [63] and AMD/ATI's Radeon R700

[4], which manage thousands of in-flight threads concurrently. Moreover, we now see the

emergence of systems, like Nvidia’s Fermi [64], that employ both approaches by

combining large caches with numerous in-flight threads.

Since such a combination of two very different approaches is used to overcome the

memory bottleneck, performance prediction becomes non-intuitive and challenging. The

extent to which an application will benefit from either approach depends on many

architecture and workload parameters. Moreover, the relative impact of caching

compared to multi-threading changes as the number of threads scales up. This complex

behavior, in turn, poses a challenge for architecture designers, who need to allocate the

limited on-die resources to cores, thread contexts, and caches. Finally, given a diversity

of already available high-performance architectures, there is the question of which is the

best fit for a given workload.

In this work we address these challenges by developing a simple, high-level, closed-

form model that captures both the architecture and the application characteristics (see

Section 3). The modeled machine uses a parameterized combination of both mechanisms

for memory latency masking, and can thus capture a range of machines, rendering the

comparison between them meaningful. The workload model, in turn, captures the salient

properties of the program, which allows one to predict which architecture is most

beneficial for it. All the parameters— capturing both architecture and workload— can be

40

used as ''knobs'' for studying a wide range of scenarios, in order to comprehend the

interplay among multiple parameters in a clean, qualitative way. The model thus serves

as a vehicle to derive intuitions.

In Section 4, we study how different properties of an application affect performance

and power. We identify three families of workloads with distinct behavior patterns:

While some workloads have a clear affinity towards either caching or multi-threading,

others can benefit from both. Moreover, some workloads exhibit an unintuitive "valley"

between the cache efficiency zone and the thread efficiency zone, where performance

takes a dip.

In Section 5 we back our analytical model by simulations. Our results indicate that the

simple, closed-form model of Section 3 can, in most cases, predict dynamic behavior, and

can thus be used to select the most efficient hardware structure for executing a given

program. Whereas Section 4 concentrates on synthetic workloads, Section 5 studies

workloads from the PARSEC benchmark suite [12], and shows that the three distinct

behaviors observed in Section 4 are indeed present in real workloads.

To summarize, our main contributions in this work are as follows:

• We present a simple closed-form model for systematically reasoning about

complex phenomena; the model captures the behavior of parallel workloads on

high performance many-core engines that employ any combination of caching

and aggressive multi-threading.

• We conduct a qualitative study of the inherent tradeoffs between the two

approaches for memory access masking, and their sensitivity to a range of

parameters. Our study yields non-intuitive observations regarding the impact

of architectural choices and workloads characteristics on performance and

power.

• We validate our model via simulation of real workloads.

Finally, we believe that our model can direct further research on ways to address the

memory wall problem in the new generation of many-core CMP. We elaborate on future

research direction in Chapter 4.

41

2. Related Work

Though there are many existing analytical models for processors’ performance, they

mostly concentrate on a single family of processors (either multi-core, multi-thread,

GPU, etc.), and thus on one of the two paradigms – either caching or multithreading. To

the best of our knowledge, our model is the first to specifically target the interplay of the

two paradigms and is the first to model both via a single, unified model that enables a

clean comparison across the design space of the new generation of high performance

engines.

Agrawal [1] studied the effect of the degree of multithreading on performance in the

early 90’s. In retrospect, Agrawal's analysis can be seen as applicable to our cache

efficiency zone - it observes a similar trend to the one exhibited in the left side of our

performance curves (see Section 4), albeit only for thread counts of up to a few dozens.

The same trends have been demonstrated later via other analytical models [70] [72].

These works, however, considered only a small thread count, and hence are not

applicable for machines that manage thousands of in-flight threads.

Hong and Kim [39] [40] and Baghsorkhi et al. [5] have propose detailed analytical

models for GPU machines, which can be seen as applicable to our multithreading zone.

However, targeted at GPU architecture only, these models do not consider large caches,

and hence are not applicable for machines where caches are a key factor in determining

performance.

Several works have studied the performance of multi-threaded applications that

alternate between parallel and serial/critical section phases [26] [37] [58] [74], and show

that for these applications asymmetric machines are favorable. We focus on a different

domain of workloads, namely applications that are dominated by their parallel phase and

that can be parallelized into a large number of independent threads. We thus consider

only symmetric architectures.

Previous characterizations of the PARSEC benchmark suite [10] [11] [12] concentrate

on machines with significantly fewer cores than we do, and parallelization only up to 32

threads. We push multi-threading as well as the number of computation units to the

hundreds.

42

3. The Analytical Model

In order to study the basic tradeoffs of caching and multithreading over the range of

high performance engines and applications, we use a high-level, abstracted model that

can capture both mechanisms. This abstraction enables us to derive specific instances for

different machines from the same unified framework in a way that renders the

comparison meaningful. To enable elementary reasoning of the basic tradeoffs, we

purposely use a simple, first-order model. Indeed, the model can be augmented to account

for various additional effects (see the discussion in Chapter 4), but this should come

second to the basic tradeoff of caches vs. threads captured in this work.

While different architectures may differ in their programming model, we do not

consider programming issues here. Rather, we assume that the same applications can be

mapped to different engines across the range; frameworks like OpenCL [66] and Ocelot

[24] [25] are expected to allow for such cross-platform mappings. The different models,

however, are commonly described using different terminologies, which can be confusing.

Our terminology follows the one used in multi-threaded programming models like

CUDA [61], where a thread is a basic execution stream that processes a single data

element. A processing element (PE) is a processing unit that processes a single such

light-weight thread at a time; CUDA also uses the term Streaming Processor (SP) for a

PE. In programming models like Larrabee Native [70], each core executes a SIMD

instruction that processes several (e.g., 16) data elements at the same time. Thus, in our

terminology, a Larrabbe core is composed of 16 PEs, which can execute 16 threads at a

time. (Note that our notion of threads is different from traditional operating systems

threads; such light-weight threads are called strands in Larrabee Native.)

We next construct the model as follows: We first define the pertinent parameters of

the hardware and the workload characteristics (Section 3.1). We then derive performance

and power expressions for the abstracted machine (Section 3.2).

3.1. Hardware and Workload Model

Our abstracted machine includes an array of NPE processing elements and a large on-

chip shared cache of size S$. For simplicity, we only model the shared cache (L2/L3), and

43

consider local L1 caches, if they exist, to be part of the processing element. In addition,

the machine includes a register file for storing the contexts of up to Nmax in-flight threads;

we assume that this is the maximum number of threads running concurrently. We

consider simple, in-order PEs, for which the average number of cycles required to

execute an instruction is CPIexe (assuming a perfect, zero-latency memory system). We

assume that the machine is symmetric; hence all PEs run at the same frequency f. The on-

chip cache latency is t$, while the off-chip memory can be accessed at a latency of tm

cycles, and a bandwidth of BWmax, where each operand's size is breg. The hardware

parameters are summarized in Table 3-1.

Table 3-1. Hardware Parameters

Parameter Description

NPE Number of PEs (in-order processing elements)

S$ Cache size [Bytes]

Nmax Maximal number of thread contexts in the register file

CPIexe Average number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$ Cache latency [cycles]

tm Memory latency [cycles]

BWmax Maximal off-chip bandwidth [GB/sec]

breg Operands size [Bytes]

Clearly, the characteristics of the workload have a great impact on the attainable

performance. Recall that our focus is on data-parallel workloads which can be

parallelized to numerous independent threads. For benchmarks in this general family,

44

there are three key parameters that impact performance: (1) the scalability of the

workload, captured by the number of threads that can execute (or be ready to execute)

concurrently, n; (2) the compute intensity of the workload, captured by the ratio of

memory instructions out of the total number of instructions, rm; and (3) the locality of the

workload, captured by the thread cache hit rate function, Phit(s, n), where s is the cache

size and n is the number of threads that share the cache. Note that the latter captures the

hit rate in the shared (L2/L3) cache; a high hit-rate in the L1 cache, if such exists, is

manifested as a higher compute-to-memory ratio. The workload characteristics are

summarized in Table 3-2.

Table 3-2. Workload Parameters


n Number of threads that execute or are in ready state (not blocked) concurrently

rm Fraction of instructions accessing memory out of the total number of instructions [0≤ rm≤ 1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

3.2. Performance and Power Equations

We now use the parameters defined in Table 3-1 and Table 3-2 to analyze expected

performance. In this context, we make the simplifying assumption that the workload

parameters are fairly static, and do not vary much over time or space (i.e., at different

threads of the same application). We therefore use their average values in the equations

below. When validating our analysis using simulations (Section 5.3), we shall see that

this assumption holds for most of the benchmarks considered, with few exceptions.

Given each thread’s cache hit rate function and the cache and memory latencies as

defined in Section 3.1, we can compute the average number of cycles needed for data

access, denoted tavg ; (tavg is sometimes called Average Memory Access Time (AMAT)):

( )$ $$( , ) 1 ( , )avg hit hit m

t P n t P S tS n= ⋅ + − ⋅ (1)

45

Any given thread needs to stall once every 1/rm instructions on average, and wait until

the data it accesses is received from memory. During this stall time, the PE is left

unutilized, unless other threads are available to switch-in. The number of additional

threads needed in order to fill the PE’s stall time is avg

mexeCPI r

t, and hence

1 mPE avg

exe

rN

CPIt⋅ +

⋅

threads are needed in order to fully utilize the machine.

Given a workload with n ≤ Nmax threads, the processor utilization, 0 ≤ η ≤ 1, (the

average utilization of all PEs), is:

1 ,min

1PE avg

exe

m

CPI

r

n

N tη =

⋅ + ⋅

(2)

The minimum in Equation (2) captures the fact that after all execution units are

saturated, there is no gain in adding more threads to the pool. If we ignore bandwidth

limitations, the expected performance is simply PE

exe

Nf

CPIη⋅ ⋅ OPS (Operations Per

Second).

However, since bandwidth to external memories is limited, this performance level

cannot always be reached. In fact, off-chip bandwidth is a principal bottleneck that often

limits performance. For a given workload, a given number of threads, and given

performance (in OPS), the off-chip bandwidth generated can be expressed as:

( )$1 ( , )

m reg hitBW Performance r b P S n= ⋅ ⋅ ⋅ − (3)

Hence, given an off-chip bandwidth limit BWmax, the maximal performance achievable

by the machine is ( )( )max $/ 1 ( ),hitm reg

BW r b P S n⋅ ⋅ − . Thus, performance can be expressed

using the following equation:

max

$

[ ] min ,(1 ,( )

PE

m reg hitexe

BWPerformance OPS N

CPI

f

r b P S nη= ⋅ ⋅

⋅ ⋅ −

(4)

46

Table 3-3. Hardware Power Parameters


eex Energy per operation [j]

e$ Energy per cache access [j]

emem Energy per memory access [j]

Powerleakage Leakage power [W]

With power and energy consumption becoming key factors in practically all modern

computer systems, performance under a given power envelope and power efficiency

become primary design targets. Power consumption can be modeled as

Powerleakage+Performance·EPI, where Performance is given by Equation (4), and EPI is

the average consumption of Energy Per Instruction. Using the notations in Table 3-3, the

power consumed can be expressed using the following equation:

( )( )$$ $( ) 1 ( ), ,leakage ex m hit hit mem

Power Power Performance e r P S e P S en n= + ⋅ + ⋅ ⋅ + − ⋅ (5)

Notice that as the number of concurrent threads grows, the hit rate for each of them

degrades, and thus more accesses are served from memory. Since memory access is

significantly more energy-costly than access to the on-die cache, the EPI increases with

the number of threads. This effect is more significant for architectures which achieve

their performance via a very high thread count.

4. Performance and Power Curve Study

In this section we study how various workload characteristics affect the performance

(Section 4.1) and power (Section 4.2) curves. For this study, we use an example machine

consisting of 1024 PEs and a 16MB cache. The machine supports up to Nmax = 65536 in-

flight threads and runs at a frequency of 1GHz with a CPIexe of 1 cycle. The machine

requires 1 cycle to access its on-chip cache and 200 cycles to access off-chip memory,

whose bandwidth is 200GB/sec. We assume single precision calculation (i.e., an operand

size of 4 bytes). The machine's parameters are summarized in Table 3-4.

47

Table 3-4. Machine Parameters

Parameter Value

Number of PEs (NPE) 1024

Data Cache Size (S$) 16MByte

Number Of Threads (Nmax) 64K

Frequency (f) 1 Ghz

CPIexe 1 cycle

Cache latency (t$) 1 cycle

Memory latency (tm) 200 cycles

Max off-chip Bandwidth (BWmax) 200GB/s

Size of operands (breg) 4 bytes

We begin with synthetic workloads to enable a clean study of trends and the effect of

different parameters on the performance plot. These will be replaced with real workloads

in Section 5. We use the next simple cache hit rate function [45]:

$

( 1)

$( ) 1 1,

hit

SP S

nn

α

β

− −

= − +⋅

(6)

This function is based upon the well known empirical power law from the 70’s (also

known as the 30% rule or the √2 rule) [18]. In Equation (6), workload locality increases

when increasing α or decreasing β. The parameter β also accounts for the degree of

sharing among the threads: in case much of the cache is shared, each thread can utilize a

larger portion of the cache, which is represented by a small value of β.

4.1. Performance Curve Study

4.1.1. Parameters Sensitivity

Figure 3-1 shows the performance vs. the number of threads, n, available in the

workload, for synthetic benchmarks with different cache hit rate functions (i.e., data

locality).

48

Figure 3-1. Performance vs. number of threads for benchmarks with different cache hit rate

functions (Phit(s,n)), increasing from 0 (no cache) to 1 (a perfect cache), rm=0.1. Performance

increases with increased locality, especially in the cache efficiency zone.

In Figure 3-1, three performance regions are clearly evident: In the leftmost region, as

long as the cache capacity can effectively serve the growing number of threads,

increasing the number of threads improves performance, as more PEs are utilized. This is

the cache-efficiency zone. At some point, the cache becomes too small for the growing

stream of access requests, so memory latency is no longer masked by the cache and

performance improves more moderately, or even takes a dip into a valley. As the number

of available threads again increases, the multithread efficiency zone (on the right) is

reached, where adding more threads improves performance up to the maximal

performance of the machine, or up to the bandwidth wall. In Section 4.2 we show that

power considerations also limit the achievable peak. Only scalable workloads with a high

enough number of independent threads can benefit from this region.

Figure 3-1 shows that workloads with higher locality better exploit the cache and

hence expand the cache efficiency zone to the right and up. Workloads with poor locality

cannot utilize the cache and hence gain performance only from increase in their thread

level parallelism. Moreover, the shape of the performance curve depends on how fast the

cache hit rate degrades as a function of the number of threads: The valley occurs

whenever the degradation in cache hit rate is of the form (1 )n

ε+ for some positive ε,

49

representing a super-linear dependency of the hit rate degradation in the number of

threads. (This degradation rate can be computed by deriving the performance formula

(Equation 4) as a function of n). We see that when this condition is not met (e.g., the

dark-gray solid curve; α=3.5, β=13), there is no valley between the two regions.

Another point to notice in Figure 3-1 is that, once the bandwidth requirements exceed

the capacity of the physical link, the climb stop, and performance actually starts to

degrade. To explain why this happens, recall that Phit is affected by the number of

threads— the more in-flight threads there are, the less cache is available to each of them.

Therefore, when the off-chip bandwidth wall is met, adding more threads only degrades

performance due to increasing off-chip pressure.

Figure 3-2 shows how the compute intensity of the workload affects the shape of the

performance plot. When there are more computation instructions per memory access, (a

smaller rm), performance climbs more steeply with additional threads. This is because as

more instructions are available for each memory access, fewer threads are needed to fill

the stall time resulting from waiting for memory. Thus, compute-intense applications can

reach peak performance with less parallelism and smaller bandwidth requirements. All in

all, a high compute/memory ratio decreases the need both for caches and for scaling the

application to many threads.

Figure 3-2. Performance in a limited BW environment for benchmarks with different

percentages of memory instructions (rm), α=7, β=50. Performance increases with the

compute intensity, i.e., as rm decreases.

4.1.2. Workload Families

Looking at the curves of Section

exhibit three different behavior patterns depending on their parameters.

schematically plots these three examples (stopping before the bandwidth saturation

point). Simulation results

can be found in “real” workloads

Figure 3-3. Performance curve for the 3 types of workloads: (A) Workloads with a constant

hit-rate in the cache exhibit linearly increasing performance (the slope depends on the hit

rate), (B) workloads exhibiting a nonlinear but monotonicall

hit-rate is mildly reduced as more and more threads share the cache), (C) workloads

exhibiting a performance valley (their hit rate is sharply reduced as the cache is shared by

more and more threads).

The performance of wor

number of threads. These

of the number of threads.

dashed curve in Figure 3

sharing of data among all threads (assuming all data is

3-1) in the other extreme case.

multithreading-based or classic cache

cache hit-rate.

50

Workload Families

ng at the curves of Section 4.1.1, we observe that, by-and

exhibit three different behavior patterns depending on their parameters.

se three examples (stopping before the bandwidth saturation

Simulation results (Section 5) will later validate that all three classes of behavior

be found in “real” workloads.

. Performance curve for the 3 types of workloads: (A) Workloads with a constant

rate in the cache exhibit linearly increasing performance (the slope depends on the hit

rate), (B) workloads exhibiting a nonlinear but monotonically increasing performance (their

rate is mildly reduced as more and more threads share the cache), (C) workloads


The performance of workloads of class A (dashed line) grows linearly with the

These workloads have a constant cache hit rate which is indepen

of the number of threads. This may happen for example due to poor

3-1, cache hit rate = 0%) in one extreme case, or due to full

sharing of data among all threads (assuming all data is cached, black solid

) in the other extreme case. Workloads of this class can efficiently use either classic

based or classic cache-based architectures, depending on their (constant)

and-large, workloads

exhibit three different behavior patterns depending on their parameters. Figure 3-3

se three examples (stopping before the bandwidth saturation

date that all three classes of behavior

. Performance curve for the 3 types of workloads: (A) Workloads with a constant

rate in the cache exhibit linearly increasing performance (the slope depends on the hit

y increasing performance (their

rate is mildly reduced as more and more threads share the cache), (C) workloads


grows linearly with the

workloads have a constant cache hit rate which is independent

This may happen for example due to poor locality (light-gray

cache hit rate = 0%) in one extreme case, or due to full

solid curve in Figure

ciently use either classic

based architectures, depending on their (constant)

51

The other two workload classes present non-linear behavior. Both have an operation

zone where the cache is more effective and an operation zone where multithreading is

more effective, but they differ in the area between these two zones. Workloads of class B

(solid line) exhibit a monotonically increasing performance. They are characterized by a

sub-linear degradation of their cache hit rate function in the number of threads. The

transition from the cache effective zone to the multi-threading effective zone does not

incur a performance lose but rather a reduced rate of performance improvement.

Workload of this class will perform better with aggressive multi-threading - at least in an

unrestricted environment where no memory or bandwidth constraints exist.

Workloads of class C (dotted line) present a valley-like behavior, and are

characterized by a super-linear degradation of the hit rate in the number of threads.

Optimizing the architecture for such workloads is especially challenging, because by

trying to leverage a combination of the two approaches, these workloads might end up in

a performance zone inferior to their achievable peaks either in cache-only or

multithreaded-only architectures.

Lastly, note that the x axis, n, represents the number of threads that can actually run at

a time, and does not include ones that are blocked, either on I/O or on synchronization. In

workloads with extensive synchronization or I/O activity, n will be limited, so the plotted

performance curve will be pruned somewhere along the x axis. Likewise, recall that the

achievable peak in the multithread zone is limited by the maximal bandwidth (as seen in

Figure 3-1) and, as we show next (in Section 4.2), by the machine's power envelope.

4.2. Power Efficiency and Performance under a Power Envelope

In the following study of power costs, we take as an example an energy per operation

(eex) of 0.1nJ, and factors of 5 and 50 for accesses to cache (e$) and memory (emem),

respectively. We note that this is only one such example; using the analytical model,

other ratios can be plugged in to derive results.

52

(a) Performance/Power (1/EPI)

(b) Performance2/Power (1/(energy·delay))

Figure 3-4. Efficiency metrics: (a) Performance/Power (1/EPI) and (b) Performance2/Power

(1/(energy·delay)) of the unified machine, α=7, β=50 . The former is always better in the

cache efficiency (left hand) zone, whereas for the latter, the favorable archetype varies

according to the workload characteristics.

53

The performance versus power tradeoff is typically studied using one of the following

two efficiency metrics: the normalized power consumption per instruction, which is

captured by Performance/Power, or energy·delay which is captured by

Performace2/Power. Figure 3-4 presents these two metrics for the benchmarks of Figure

3-2. We see that in terms of Performance/Power metric (Figure 3-4(a)), using caches is

always preferable as they enable serving most of the data accesses from the cache rather

than memory. For the more performance-oriented metric (Figure 3-4(b)), some of the

workloads favor caching, whereas others favor multi-threading. For example, for the

dark-gray dashed curve (rm=0.2), the performance boost via multithreading comes at too

high a power toll, and hence caching is the preferable approach in this case. For the light-

gray solid curve (rm=0.05), the performance boost from multithreading exceeds the

increase in power consumption, and hence multithreading is the favorable approach.

The results above were obtained assuming that power consumption is not constrained.

In practice, however, machines have a limited power envelope under which they need to

function. Figure 3-5 assumes a power envelope of 300 Watts, and revisits the

performance curve of Figure 3-2 under this constraint.

Figure 3-5. Performance under a power envelope of 300 Watts, α=7, β=50. Power

constraints limit performance achievable via the use of multithreading since it increases the

frequency of accesses to main memory, which are more expensive in terms of power.

54

Recall that increasing the number of threads increases the energy spent on each

instruction. Thus, the number of threads that can be run within a given power envelope is

bounded, and the achievable performance in a limited power environment is smaller than

the theoretical (unconstrained) peak performance of the machine.

In Section 3.2 we saw that for highly parallel workloads, the best performance can be

achieved by a high thread count, which provided the motivation for aggressively multi-

threaded machines with simple cores and no caches. Nevertheless, we now see that this

approach is limited by power constraints, and that caches can help reduce the energy

costs per instruction, especially for workloads with good locality properties. This is part

of the reason why new GPU engines incorporate more on-chip memory than in the past.

5. Performance Simulation

In this section we back our analytical model by simulation. First, we describe our in-

house simulator (Section 5.1), and detail the workloads system parameters (Section 5.2).

In Section 5.3 we present simulation results, and in Section 5.4 we study how different

hardware parameters affect the shape of the performance valley, using simulation of a

real-world workload as a test-case.

5.1. Simulator

We use an in-house simulator, MTM$im, specifically designed for simulating

graphics-oriented architectures with numerous cores. MTM$im uses the Pin binary

instrumentation tool [53]. It models a highly multi-threaded machine with a

parameterized number of cores and a large shared cache (of parameterized size). Each

core is a simple processing element, such that every instruction takes a predefined

number of cycles to execute unless it accesses the memory hierarchy. MTM$im

implements a shared last level cache and parameters define the latencies for memory

access in cases of hit or miss. The simulated machine saves a large number of thread

contexts, (typically significantly larger than the number of cores), such that when one

thread is blocked on memory access (i.e., cache miss), another thread is swapped in and

55

utilizes the machine. We use a simple round-robin scheduling policy among all available

threads.

MTM$im decouples timing/scheduling of each instruction from its functional

implementation. The simulated machine engine determines when every instruction of

every thread is executed via its timing model, and uses Pin only to determine what each

instruction does. Since the timing simulator determines when each thread advances, this

approach captures dynamic and transient time-dependent effects such as fine-grained data

sharing, thread synchronization, and in general, inter-thread communication and inter-

thread interactions. It further exposes dynamic variability within the benchmark, which is

not captured by our analytical model. CMPSched$im [59] and GEMS [56] both take the

same approach of decupling simulation functionality and timing, the former using Pin as

its functional model and the latter using Simics [54].

MTM$im was preferred over Simics because we are interested in the application alone

rather than the entire operating system stack. Moreover, since thread scheduling is done

within the machine and not via the operating system, it was important to filter out code

related to the operating systems itself. Lastly, MTM$sim also enables very fast

simulation, running at about 10MIPS compared to the roughly 10KIPS achievable in

Simics in-order mode.

5.2. Workload and System Parameters

We study workloads from the PARSEC benchmark kit [12], concentrating on those

with relatively high scalability: blackscholes, raytrace, swaptions, canneal, dedup, and

bodytrack. (Since we are interested in pushing the number of threads to hundreds, we

leave out benchmarks from the kit that either have very limited scalability, or that cannot

be spawned with hundreds of threads [10] [12].) We chose the PARSEC kit because it

represents emerging workloads, specifically modeling future CMP applications [11]. The

PARSEC kit is more diverse than traditional parallel workload suites, which focused

more narrowly on high-performance computing. We run all workloads using the

simmedium inputs, and vary the number of instantiated threads from 1 up to around 2000

or to the point where adding threads is no longer effective. (We have limited the number

56

of threads to 2000 due to the excessive memory requirements of the Pin tool when

simulating such a large number of threads.) Note that the actual number of available

threads, n, is often lower than the number of instantiated threads, because threads are

sometimes blocked due to synchronization or system calls. For all workloads and all runs,

we fast forward through the initialization phase of the benchmark, and sample only the

parallel phase where all threads have been spawned.

Note that even scalability to hundreds of threads falls short of the futuristic values

used in the synthetic model above. To capture the performance trends within the much

smaller simulated range, we scale down the machine parameters, to 128 PEs and a 4MB

shared cache. Cache hit time is 1 cycle and memory latency is 200 cycles. The number of

in-flight thread contexts is set to 10000 so that it is not a limiting factor.

5.3. Results

Table 3-5. Memory instruction ratios (rm) for PARSEC's workloads

Benchmark Percent Of Memory

Instructions (100·rm)

blackscholes 32.72

swaptions 43.26

raytrace 64.19

dedup 36.55

canneal 34.23

bodytrack 29.33

In order to validate our analytical study, we extract the actual average workload

parameters (effective number of threads n, hit rate Phit, and compute/memory ratio rm).

We then substitute these values in the equations of the analytical model (Section 3.2), and

compare the performance values predicted by the model to the performance values

measured by the simulator. Figure 3-6 shows the simulated performance results and the

performance predicted by the analytical model for the different workloads. For each

simulation run, we plot a data point whose x coordinate reflects the average number of

available threads (running or ready) over all times in the run, and the y coordinate reflects

performance in the same run. Figure 3-6 also shows the average cache hit rate extracted

57

from simulations for each applications, as a function of the number of threads. The ratio

of memory instructions, rm, for each of the benchmarks is given in Table 3-5. We found

that for all considered workloads, rm is practically the same for any number of threads the

workload is parallelized to.

Figure 3-6. Performance vs. number of available threads as extracted from simulation and

as predicted by the analytical model with actual workload parameter values. The figures

also show the average hit rate for each workload. We see that when instantiated with

benchmark-specific parameter values, the analytical model provides a very accurate

prediction of performance for most benchmarks; this implies that benchmark parameters

58

vary very little both in space and in time during their parallel phase. One exception is

bodytrack, where the number of active threads varies greatly in time, and hence our

analysis, although still predicting the general trend, does not closely match the simulation

results. We further observe that in embarrassingly parallel workloads like blackscholes and

swaptions, (financial workloads) the valley shape is clearly exhibited, with a steep climb in

the former and a mild climb in the latter. Other workloads, like canneal (simulated

annealing) and dedup (compression), are not sufficiently scalable to climb out of the valley

and benefit from the MT zone. Lastly, workloads like raytrace do not descend into a valley,

and present better performance as more threads are spawned.

We observe that the analytical model predicts performance accurately for our target

applications, namely, symmetric, parallel workloads. This essentially shows that we can

represent a workload very accurately using three numbers – n, Phit, and rm, and that our

analytical model, despite being simple, effectively captures the interplay of these three

parameters with the architecture. Moreover, the close correspondence between the model

and the simulations also shows that using average values of hit rate and compute-to-

memory ratio is a reasonable approximation in most cases. For asymmetric workloads

like bodytrack, where the workload parameters vary in space (i.e., at different threads)

and in time, our analysis is less accurate. We note that a more detailed simulator might

well amplify the deviation of the model prediction from the simulation results.

Nevertheless, such deviations would not undermine our qualitative study, which does not

seek to obtain quantitative expected performance numbers on any given hardware.

We further see that the PARSEC workloads span the range of behaviors depicted in

Figure 3-3. Some workloads (blackscholes, swaptions - financial workloads) exhibit a

valley-like shape. Indeed, as can also be seen in Figure 3-6, both incur super-linear

degradations in their hit rates as the number of threads increases. The two workloads,

however, differ in the gradient of their performance growth in the multithreading

effective zone: being more compute-intensive, blackscholes climbs faster than swaptions,

gaining more performance out of every additional thread. The higher compute intensity

and better locality allow blackscholes to also reach a higher peak in the cache zone

compared to swaptions. In fact, blackscholes fully utilizes all execution units of the

simulated machine in the cache effective zone (hence the plateau in its peak).

59

On the other hand, raytrace (a graphics workload) does not exhibit a valley. Indeed, its

performance continues to increase with every additional thread. This is because, as Figure

3-6 also shows, its cache hit rate degrades sub-linearly, thanks to extensive data sharing

among threads.

Other workloads, like canneal (simulated annealing) and dedup (compression

workload) can only operate in the cache efficiency zone due to their limited scalability.

Despite spawning numerous threads when running these benchmarks, we could not get

more than 250-300 of their threads to run (or be ready to run) concurrently. Notice that

this limited scalability is not due to the overhead of the synchronization primitives

themselves, but rather due to true dependencies among threads, which block waiting for

each other.

5.4. Performance as a Function of Hardware Parameters

While in Section 4.1 we study how workloads parameters affect the shape of the

performance curve, in this section we study how a given workload behaves across

different machines. We take blackscholes as a real example and use its real characteristics

(i.e., rm, and P(S$,n)) as extracted from simulations.

Figure 3-7 presents the behavior of blackscholes for machines with different cache

sizes. Naturally, larger caches are able to extend the cache efficiency zone up and to the

right. Less intuitive is the fact that caches are also crucial in the multithreading zone.

Notice that for larger caches, the peak achievable in that region is higher. This is because

the achievable performance via the use of multithreading is limited by the machine

bandwidth, and hence larger caches are able to better reduce the pressure on the external

memory and therefore hinder the point of bandwidth saturation longer.

Figure 3-8 presents the behavior of blackscholes over machines with different memory

latencies. We see that longer latencies reduce the rate of the climb in the multithreading

zone, since more threads are needed to mask the longer latencies to memory. Since the

memory latency wall is only getting worse, caches will become even more critical in the

future, as gaining performance out of multithreading will become increasingly hard.

60

Figure 3-7. Performance of the blackscholes workload across machines with different cache

sizes.


memory latencies.

6. Summary

This work sought to shed some new light on the two fundamental approaches to

achieving high-performance in the multi-core era, namely caching and aggressive multi-

threading. To this end, we presented a simple closed-form model, validated by

simulations. To the best of our knowledge, ours is the first analytical model to account

for both memory-masking techniques. As such, our model captures current architectures

61

that employ either one of the approaches, as well as novel high performance engines, like

Nvidia's Fermi, which leverage both.

Our model facilitates reasoning about complex phenomena that arise when both

approaches are in play. We used it in a qualitative study of representative workloads on

characteristic high-performance architectures. We observed that as the number of threads

scales up, different benchmarks exhibit very different performance curves. In some cases,

perhaps counter-intuitively, performance is not monotonic with the number of threads.

Finally, we believe that our model can direct further research on ways to address the

memory wall problem in high-performance engines. We discuss this in length in Chapter

4.

62

Chapter 4.

Summary and Future Work

In this chapter we summarize our work and contributions, as well as propose future

research directions.

1. Summary

In this dissertation we have tackled several aspects related to the interplay between

caches, multi-threading, and parallel workloads in CMP systems. With multi-core and

many-core becoming ubiquitous, this interplay plays a primary role in overcoming one of

the key challenges of the CMP era - mitigating the memory wall.

In Chapter 2 we presented Nahalal - a novel cache architecture for LLC in multi-core

systems. The Nahalal architecture addresses the remoteness of shared data – a

phenomenon experienced by multithreaded applications running on multi-core machines.

Via characterization of typical multithreaded workloads we revealed that many of the

accesses to the cache in such workloads target a relatively small number of shared lines,

which are shared among many of the threads in the programs. We showed that this

phenomenon, which we dub the shared-hot-lines effect, severely hampers the cache

performance as shared lines are bound to reside far away from at least some of their

sharers, incurring long access times.

The Nahalal floorplan topology partitions cached data according to its usage – shared

data versus private data – and thus enables fast access to shared data by all the cores

while preserving the vicinity of private data to the core that uses it. In Nahalal, a fraction

of the LLC memory capacity budget is used for hot shared data, and is located in the

center of the chip, enclosed by all processors. The rest of the LLC memory is placed in

the outer area of the die, and provides private storage space for each core. To this end,

Nahalal is both targeted at mitigating the negative impact of the shared hot lines effect on

CMP performance, and also leverages the same effect in devising its architectural

solution. This is but one example where the interplay between multithreaded application

63

and caches in CMP systems introduces new phenomena not present in uniprocessors, and

also opens the door for new types of solutions.

We have demonstrated the potential of the Nahalal concept via detailed simulations in

Simics of two CMP design examples, where Nahalal decreased cache access latency by

up to 41.1% compared to the traditional CMP designs, yielding performance gains of up

to 12.65% in run time. Our simulations thus demonstrate the opportunity in new

approaches to cache design in which cache architecture, organization, and management,

are all specifically tailored to the true multiprocessing environment of CMP.

In Chapter 3 we turned to consider a broader view and studied the interplay of threads

and caches in many-core machines via a conceptual, high-level analytical model. Our

work provides a new model that captures the behavior of parallel workloads on different

throughput-oriented many-core engines from across the range: from cache-based

machine, to MT-based machines, to machines in between that combine the two

approaches to different extent. Specifically, we provide a simple closed-form model

which, for a given application, describes its performance and power as a function of the

number of threads it runs in parallel, on a range of architectures.

We have used our analytical model to qualitatively study the tradeoffs between these

two basic approaches for mitigating the memory wall– caching and multi-threading– in

the context of the new generation of many-core engines. We studied how different

properties of both the workload and architecture effect performance and power, and

recognized distinctly different behavior patterns for different application families and

architecture. Moreover, we discovered a non-intuitive behavior exhibited by a family of

workloads which suffer from an intermediate operation zone where machines fall into

delivering inferior performance. (We dubbed this phenomenon the performance valley.)

To this end, we consider our model as a vehicle to derive intuition, and as a tool that

enables systematically reasoning about the complex phenomena involved in the interplay

between caches, multi-threading, and parallel workloads in the new generation of

throughput-oriented many-core machines.

64

2. Future Research Directions

Throughout this dissertation we strived to emphasize the insights gained with regard to

cache design in CMP systems and the intimate interplay between caches, multithreading

and parallel workloads in modern CMP machines. We believe that these insights can be

put into use in many possible extensions that naturally follow our main line of work. In

the following subsections, we outline several such possible research directions.

2.1. Accounting for Energy in the Nahalal Architecture

In Chapter 2, the Nahalal architecture was discussed only in the context of latency

reduction- Nahalal's topology reduces the distance to shared data and thus shortens its

fetch time. But reducing the distance between the data and its clients also reduces the

energy spent for data transfer, which becomes a dominant factor in the total energy spent

per execution [19] [20]. An interesting extension is to also account for energy

consumption when comparing Nahalal with previous CIM structures, which will further

emphasize Nahalal advantage.

2.2. Nahalal Scalability

The Nahalal architecture was presented in the context of multi-core CMPs. Despite its

merits, it is important to note that like the traditional cache-in-the-middle layout, Nahalal

also has limited scalability. This is since the middle area grows as the square of the

number of cores around the circumference. While Nahalal's design is feasible for a

moderate number of cores, massively scalable CMPs (i.e., many-core) employing

hundreds of cores clearly require a different topology. In such systems, one appropriate

solution might employ a clustered design, whereby the cores and memory banks are

organized as a collection of closely knit clusters (Figure 4-1b).

This layout, like the layout of the Nahalal architecture itself, is also inspired by urban

design ideas from the 19th century [41], where clusters of several garden cities are linked

together. (See Figure 4-1a.) Designing and evaluating such scalable architectures is an

interesting direction for future work.

65

(a) A cluster of Garden-Cities [41]. (b) Clustered Nahalal CMP design.

Figure 4-1. A clustered design for many-core CMP, where each cluster is organized in the

form of Nahalal.

2.3. Extending Nahalal's Principals into New Cache Dimensions

The Nahalal architecture challenges the ways in which caches are organized.

Traditionally, uniprocessor microarchitectures treated all data the same, portioning the

cache almost solely hierarchically (e.g., cache levels: L1, L2, etc.). We argue that with

CMP architectures, additional cache dimensions will prove valuable, for example, based

on data coherency, different data usage patterns, and other CMP characteristics.

In this work we have developed the CMP data sharing paradigm: Nahalal

differentiates cached data according to its usage –private vs. shared, and treats each of

these types differently. Further research can use the same principles and apply them to

different usage dimensions. Some examples for future study may include partitioning

caches according to the required coherence levels, transactional versus non-transactional

memory, special-purpose areas for semaphores, different areas for different data usage

patterns, etc.

We note that these future research directions can be seen as but one thread in the

overall trend towards architecture asymmetry in CMPs, which is mainly considered in the

context of making the processors asymmetric [37] [50] [51] [58].

66

2.4. Enriching the Many-Core Analytical Model and the MTM$im Simulator

The high-level analytical model discussed in Chapter 3 is targeted at studying the

basic tradeoffs between threads and caches in many-core engines. It opts for simplicity in

order to provide a model that is easy to comprehend and that exhibits a clean view of the

trends. We therefore purposely used a simple, first-order model, and neglected aspects of

the architecture which we believe to be of second order.

Nevertheless, a natural extension of our work will gradually add these various aspects

into the model and will study their affect on the resulted performance. For example, of

particular interest is accounting for more complex cache hierarchies and cache coherency

protocols.

Moreover, while accounting for more complex hardware structures might turn out to

be too complicated for the analytical model, the MTM$im simulator can relatively easy

be extended to support them. Cache hierarchies, for example, are a natural extension to

the simulator, and will enable to study the behavior of real workloads under more

realistic systems. Such a study is expected to provide further insights into the behavior of

these systems and the workloads that are expected to run on them.

We also encourage broadening the scope of applications that are studied with the aid

of the MTM$im simulator. While in Chapter 3 we have considered only sample

workloads from the PARSEC kit as an example, we believe that studying different

workloads and especially CUDA based workloads [62] will provide valuable insights. To

this end, CUDA-based workloads are skewed towards MT-based machines and present

better scalability properties than those of PARSEC. They are hence an important class of

throughput-oriented application worth studying.

2.5. Eliminating the Performance Valley via Dynamic Morphing

In Chapter 2, we identified different behavior patterns for different workloads classes,

and showed that different workloads have different preferences in regard to the machine

most favorable for them. For example, some workloads will gain nothing from using a

cache and are better off spawning as much threads as possible, while others can exploit

the cache quite effectively and would achieve high performance as long as we avoid

67

spawning too many threads that cause heavy cache trashing. In general, the preference

towards more on-die storage or more threads can vary not only among different

workloads, but also between different phases of the same program, and for different

thread count ranges within the same application.

We believe that in order to provide optimal performance for a wide range of

workloads, the machine must dynamically tune itself to each workload, morphing

between a cache-based machine and a thread-base machine according to indications

gathered dynamically by the hardware. Conceptually, this approach will allow the

machine to eliminate the valley, as depicted in Figure 4-2. Moreover, as future

throughput-oriented engines will combine both techniques to cope with the memory wall

(both caching and multithreading), balancing between the two for each workload is a

must in order to ensure that the machine will not end up running in the valley.

Figure 4-2. A conceptual performance plot of a unified machine enhanced with dynamic

morphing. The morphed engine is able to escape the valley.

In terms of the needed hardware, we envision a unified on-die storage that is used for

both data caching and thread context storage, whereas the machine decides how to divide

the storage between these two functions according to the concrete workload

characteristics. One possible implementation for this unified on-die storage can enhanced

the regular cache to also store all thread contexts, and will prefetch relevant contexts to

the engines in a timely fashion. In terms of performance counters, the dynamic morphing

scheme fits within the general trend of providing various performance counters by the

Cache Non-Effective point

Perf

orm

an

ce

Max performance

# Threads

68

hardware. We find this extension to our research an exciting and very promising research

direction.

69

References

[1] A. Agrawal, “Performance tradeoffs in multithreaded processors,” IEEE

Transactions on Parallel and Distributed Systems, 1992.

[2] V. Agarwal, M. S. Hrishikesh, S.W. Keckler, and D. Burger. "Clock rate vs. IPC:

The end of the road for conventional microprocessors," in Proceedings of the 27th

Annual International Symposium on Computer Architecture (ISCA), pages 248–

259, June 2000.

[3] V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. Jones, and B. Parady.

"SPEComp: A new benchmark suite for measuring parallel computer performance,"

in Workshop on OpenMP Applications and Tools, pages 1–10, July 2001.

[4] ATI Mobility RadeonTM HD4850/4870 Graphics-Overview,

http://ati.amd.com/products/radeonhd4800.

[5] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. –M. Hwu, "An

adaptive performance modeling tool for GPU architectures," in Proceedings of the

15th ACM SIGPLAN symposium on Principles and Practice of Parallel

Programming (PPoPP), January 2010.

[6] P. Barford and M. Crovella, "Generating representative web workloads for network

and server performance evaluation," in Measurement and Modeling of Computer

Systems, pages 151–160, June 1998.

[7] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano,

S. Smith, R. Stets, and B. Verghese, "Piranha: A scalable architecture based on

single-chip multiprocessing," in Proceedings of the 27th Annual International

Symposium on Computer Architecture (ISCA), pages 282–293, June 2000.

[8] B. M. Beckmann, M. R. Marty, and D. A. Wood, “ASR: Adaptive selective

replication for CMP caches,” in Proceedings of the 39th International Symposium

on Microarchitecture (MICRO), December 2006.

70

[9] B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip

multiprocessor caches," in Proceedings of the 37th International Symposium on

Microarchitecture (MICRO), pages 319-330, December. 2004.

[10] M. Bhadauria, V. M. Weaver, and S. A. McKee, "Understanding PARSEC

performance on contemporary CMPs," in Proceedings of the IEEE International

Symposium on Workload Characterization (IISWC), October 2009.

[11] C. Bienia, S. Kumar, and K. Li, “PARSEC vs. SPLASH-2: A quantitative

comparison of two multithreaded benchmark suites on Chip-Multiprocessors,” in

Proceedings of the IEEE International Symposium on Workload Characterization

(IISWC), September 2008.

[12] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite:

characterization and architectural implications,” in Proceedings of the 17th

International Conference on Parallel Architectures and Compilation Techniques

(PACT), October 2008.

[13] E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny, "Cost considerations in network

on chip", Integration- The VLSI Journal, special issue on Network on Chip,

Volume 38, Issue 1, October 2004, pp. 19-42.

[14] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny, "The power of priority:

NoC based distributed cache coherency", in the 1st ACM/IEEE International

Symposium on Networks-on-Chip (NOCS), May 2007.

[15] J. Brown, R. Kumar, and D. Tullsen. "Proximity-aware directory-based coherence

for multi-core processor architectures", in the 19th ACM Symposium on Parallelism

in Algorithms and Architectures (SPAA), June 2007.

[16] J. Chang and G. S. Sohi. “Cooperative caching for chip multiprocessors,” in

Proceedings of the 33rd Annual International Symposium on Computer

Architecture (ISCA), June 2006.

[17] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication,

communication, and capacity allocation CMPs," in Proceedings of the 32nd Annual

International Symposium on Computer Architecture (ISCA), July 2005.

71

[18] C. K. Chow, “Determination of cache's capacity and its matching storage

hierarchy,” IEEE Transactions on Computers, vol. c-25, pp. 157-164, 1976.

[19] W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J-H Ann, N. Jayasena,

U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck, “Merrimac: Supercomputing with

streams,” in Supercomputing (SC), November 2003.

[20] W. J. Dally and S. Lacy, "VLSI architecture: past, present, and future," in

Proceedings of the Advanced Research in VLSI conference, pp. 232—241, January.

1999.

[21] W. J. Dally, and W. Poulton, "Digital systems engineering," Cambridge University

Press, 1998.

[22] W.J. Dally and C. Seitz, "The torus routing chip," in Distributed Computing, vol. 1,

no. 3, 1986, pp. 187-196.

[23] W. J. Dally and B. Towles, "Route packets, not wires: on-chip interconnection

networks," in the 38th annual Design Automation Conference, 2001, pp. 684-689.

[24] G. Diamos, A. Kerr, and M. Kesavan, "Translating GPU binaries to tiered SIMD

architectures with Ocelot," technical report, Georgia Institute of Technology, GIT-

CERCS-09-01.

[25] G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, "Ocelot: A dynamic

optimization framework for bulk-synchronous applications in heterogeneous

systems," in the 19th International Conference on Parallel Architectures and

Compilation Techniques (PACT), September 2010.

[26] S. Eyerman and L. Eeckhout, "Modeling critical sections in amdahl’s law and its

implications for multicore design," in Proceedings of the 37th Annual International

Symposium on Computer Architecture (ISCA), June 2010.

[27] General-Purpose Computation Using Graphics Hardware, http://www.gpgpu.org/

[28] G. Grohoski, “Niagara-2: A highly threaded server-on-a-chip,” 18th Hot Chips

Symposium, August 2006.

72

[29] P. Guerrier and A. Greiner. "A generic architecture for on-chip packet-switched

interconnections, " in Proceedings of Design, Automation and Test in Europe

(DATE), 2000, pp. 250-256.

[30] Z. Guz, I. Keidar, A. Kolodny, and U. C. Weiser, "Nahalal: cache organization for

chip multiprocessors", IEEE Computer Architecture Letters, vol. 6, no. 1, May

2007.

[31] Z. Guz, I. Keidar, A. Kolodny, and U. Weiser, "Utilizing shared data in chip

Multiprocessors with the Nahalal architecture", the 20th ACM Symposium on

Parallelism in Algorithms and Architectures (SPAA), special track on Hardware and

Software Techniques to Improve the Programmability of Multicore Machines, June

2008.

[32] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, “Many-

Core vs. Many-Thread machines: Stay away from the valley,” IEEE Computer

Architecture Letters, vol. 8, no. 4, April 2009.

[33] Z. Guz, O. Itzhak, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Threads

vs. Caches: Modeling the behavior of parallel workloads," the 28th IEEE

International Conference on Computer Design (ICCD), October 2010.

[34] Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "Efficient link

capacity and QoS design for wormhole network-on-chip," in Proceedings of

Design, Automation and Test in Europe (DATE), 2006, pp. 9-14.

[35] Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “Network

delays and link capacities in application-specific Wormhole NoCs,” VLSI Design,

vol. 2007, Article ID 90941, 2007.

[36] L. Hammond, B. A. Nayfeh, and K. Olukotun "A single-chip multiprocessor," IEEE

Computer, September 1997 (Vol. 30 No. 9)

[37] M. D. Hill, and M. R. Marty, “Amdahl's law in the multicore era,” IEEE Computer,

vol. 46, July 2008.

[38] R. Ho, K. Mai, and M. Horowitz, "The future of wires," Proceedings of IEEE,

Volume 89, Issue 4, April 2001.

73

[39] S. Hong and H. Kim, "An analytical model for a GPU architecture with memory-

level and thread-level parallelism awareness," in Proceedings of the 3th Annual

International Symposium on Computer Architecture (ISCA), June 2009.

[40] S. Hong and H. Kim, "An integrated GPU power and performance model," in

Proceedings of the 37th Annual International Symposium on Computer Architecture

(ISCA), June 2010.

[41] E. Howard, "Garden cities of to-morrow," 1902. London: Swan Sonnenschein &

Co., Ltd.

[42] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger and S.W. Keckler, "A NUCA

substrate for flexible CMP cache sharing," International Conference on

Supercomputing (ICS) , June, 2005.

[43] F. Hurley, "Nahalal Jewish Settlement in. Palestine (1)," nla.pic-an23565289,

National Library of Australia.

[44] I.T.R. for Semiconductors. ITRS 2003 Edition. Semiconductor Industry

Association, 2003. http://public.itrs.net/Files/2003ITRS/Home2003.htm

[45] B. L. Jacob, P. M. Chen, S. R. Silverman, and T. N. Mudge, “An analytical model

for designing memory hierarchies,” in IEEE Transactions on Computers, vol. 45,

no 10, October 1996.

[46] L. Jin and S. Cho, “Better than the two: exceeding private and shared caches via

two-dimensional page coloring”, in Workshop on Chip Multiprocessor Memory

Systems and Interconnects, February 2007.

[47] R. Kauffmann, "Planning of Jewish settlements in Palestine," Town Planning

Review 12(2):93-116, 1926.

[48] C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure

for wire-delay dominated on-chip caches," in Proceedings of the 10th International

Conference on Architectural Support for Programming Languages and Operating

Systems (ASPLOS), pages 211–222, Oct. 2002.

74

[49] K. Kishore, R. Mukherjee, S. Rehman, P. J. Narayanan, and K.Srinathan, "A

performance prediction model for the CUDA GPGPU platform," the 16th IEEE

International Conference on High Performance Computing (HiPC), December

2009.

[50] R. Kumar, D. M Tullsen, N. P. Jouppi, and P. Ranganathan, "Heterogeneous chip

multiprocessors," in Computer, Vol. 38, No. 11, pp 32-38, November 2005.

[51] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. Farkas. "Single-

ISA heterogeneous multi-core architectures for multithreaded workload

performance," in Proceedings of the 31st Annual International Symposium on

Computer Architecture (ISCA), June 2004.

[52] C. Liu, A. Sivasubramaniam, M. Kandemir, and M. J. Irwin, “Enhancing L2

organization for CMPs with a center cell,” in Proceedings of the International

Parallel and Distributed Processing Symposium (IPDPS), April 2006.

[53] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J.

Reddi, and K. Hazelwood, "Pin: building customized program analysis tools with

dynamic instrumentation," in Proceedings of the ACM SIGPLAN 2005 Conference

on Programming Language Design and Implementation (PLDI), June 2005.

[54] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg,

"Simics: A full system simulation platform," IEEE Computer, 35(2):50–58, Feb.

2002.

[55] V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya, D. Eisenstat, W. N. Scherer III,

and M. L. Scott, "Lowering the overhead of nonblocking software transactional

memory," in the 1st ACM SIGPLAN Workshop on Languages, Compilers, and

Hardware Support for Transactional Computing (TRANSACT), June 2006.

[56] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R.

Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's general

execution-driven multiprocessor simulator (GEMS) toolset," SIGARCH Computer

Architecture News, Vol. 33, No. 4. November 2005, pp. 92-99.

75

[57] M. R. Marty and M. D. Hill, “Virtual hierarchies to support server consolidation,"

in Proceedings of the 34st Annual International Symposium on Computer

Architecture (ISCA), June 2007.

[58] T. Y. Morad, U. C. Weiser, A. Kolodny, M. Valero, and E. Ayguadé,

“Performance, power efficiency, and scalability of asymmetric cluster chip

multiprocessors,” In Computer Architecture Letters, Volume 4, July 2005.

[59] J. Moses, K. Aisopos, A. Jaleel, R. Iyer, R. Illikkal, D. Newell, and S. Makineni,

"CMPSched$im: evaluating OS/CMP interaction on shared cache management," in

IEEE International Symposium on Performance Analysis of Systems and Software

(ISPASS), May 2009.

[60] G. D. Micheli and L. Benini, "Networks on Chips: technology and tools," Morgan

Kaufmann, San Francisco 2006, ISBN-13:978-0-12-370521-1.

[61] NVIDIA, “CUDA Programming Guide 2.0,” June 2008.

[62] NVIDIA CUDA Zone, http://www.nvidia.com/object/cuda_home.html

[63] NVIDIA GeForce series GTX280, 8800GTX, 8800GT,

http://www.nvidia.com/geforce

[64] NVIDIA, "NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,"

http://www.nvidia.com/object/fermi architecture.html

[65] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for

a single-chip multiprocessor,” in Proceedings of the 7th International Conference

on Architectural Support for Programming Languages and Operating Systems

(ASPLOS), ACM Press, New York, 1996, pp. 2-11.

[66] Opencl - the open standard for parallel programming of heterogeneous systems,

http://www.khronos.org/opencl

[67] M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for

multiprocessors with cache memories," In Proceedings of the 11th Annual

International Symposium on Computer Architecture (ISCA), pages 348--354, June

1984.

76

[68] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin, "Scaling the

bandwidth wall: challenges in and avenues for CMP scaling," the 36th International


[69] R. Ricci, S. Barrus, D. Gebhardt, and R. Balasubramonian, "Leveraging bloom

filters for smart search within NUCA caches," in the 7th Workshop on Complexity-

Effective Design (WCED), June 2006.

[70] R. H. Saavedra-Barrera, and D. E. Culler, “An analytical solution for a markov

chain modeling multithreaded,” technical report, Berkeley, CA, USA, 1991.

[71] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A.

Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan

“Larrabee: a many-core x86 architecture for visual computing,” in ACM

Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.

[72] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood, “Analytic

evaluation of shared-memory systems with ILP processors,” the 25th International


[73] http://www.spec.org/jbb2000/

[74] M. A. Suleman, O. Mutlu, M. Qureshi, and Y. Patt, "Accelerating critical section

execution with asymmetric multi-core architectures," in Proceedings of the 14th

International Conference on Architectural Support for Programming Languages

and Operating Systems (ASPLOS), March 2009.

[75] M. V. Wilkes, “The memory gap,” Keynote address, Workshop on Solving the

Memory Wall Problem, in conjunction with the 27th International Symposium on


[76] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2

programs: characterization and methodological considerations," In Proceedings of

the 22nd Annual International Symposium on Computer Architecture (ISCA), pages

24–37, June 1995.

[77] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the

obvious,” Computer Architecture News, vol. 23, no. 1, pp. 14-24, Mar. 1995.

77

[78] http://www.zeus.com/products/zws/

[79] M. Zhang and K. Asanovic, "Replication: maximizing capacity while hiding wire

delay in tiled chip multiprocessors," the 32th International Symposium on


השפעת יחסי הגומלין שבין

תוכניות מרובות חוטים לזיכרון

המטמון על מעבדים מרובי ליבות

צביקה גוז

השפעת יחסי הגומלין שבין

זיכרון לתוכניות מרובות חוטים

המטמון על מעבדים מרובי ליבות

חיבור על מחקר

דוקטור לפילוסופיה לשם מילוי חלקי של הדרישות לקבלת התואר

צביקה גוז

מכון טכנולוגי לישראל –הוגש לסנט הטכניון

2010 אוקטובר חיפה א"תשע חשון

תודותתודותתודותתודות

,עדית קידר פרופסורו אבינוע� קולודני פרופסור, אורי ויזר פרופסורהמחקר נעשה בהנחיית

.בפקולטה להנדסת חשמל

הבלתי מתפשרת נה וההכוו, העזרה, המסירותברצוני להודות למנחי� שלי על , ראשונהבבראש ו

המחשבה הצלולה דר� ו, לחיי� היו לי למקור השראה �הלהט למחקר וגישת. הדר�כל לאור�

ואני , האישית והמקצועית ,להתפתחותי בתרומתכ�קשה להפריז . והכישרו� העצו� מודל לחיקוי

. מכ�אסיר תודה על ההזדמנות שניתנה לי ללמוד

אבי ר "ד, עובד יצחק, ולטר) זיגי(יששכר ר "ד, ר� גינוסר' פרופ, יבגני בולוטי�ר "דברצוני להודות ל

.על שיתו! הפעולה ועל תרומת� למחקרי, אל צידו�ישר' ופרופנדלסו� מ

הרצופה לכל תמיכת� על , אבי מנדלסו� ורוני רונ�ר "ד, בירק) צחי(יצחק ' פרופתודה מיוחדת ל

. ספור"אי�ועל עצות , �ולחלוק מחוכמתעזור להתמידית נכונות� על, אור� שנות המחקר

מרתקת סביבת מחקר ושהיו, של עדית אני מודה לחברי קבוצת המטריקס ולחברי קבוצת המחקר

.מפרהו

, אר� ברגמ�, תודה מיוחדת לגל בדישי. בטכניו� שנותייאני מודה לחברי� הרבי� שרכשתי במהל�

. בלתי נשכחתלזו זיגי ולטר ונדב לוי על שהפכו תקופה , קיריל דיאגילב

תומכי� בי ,ה ומשהי מיריובמיוחד הור, שחר ואור, י ענתיאח. אני חב תודה מיוחדת למשפחתי

ללא. ומהווי� משענת איתנה כל חיי, מקבלי� בהבנה את כל בחירותי, יג לאור� שני�יללא ס

. מסירותכ� ואהבתכ� מחקר זה לא היה יוצא לאור

על , תודה על העזרה במחקר והאוז� הקשבת. על שהיתה לי שותפה לדר�, לאשתי ליאתאני מודה

.אני אוהב אות� " מקור של כוחלותודה שהיית לי לעוג� . לילות ללא שינה ומסירות אי� ק#

.אני מודה לטכניו� על התמיכה הכספית הנדיבה בהשתלמותי

I

תקצירתקצירתקצירתקציר

בה� מספר , מערכות מחשב מודרניות מבוססות על ארכיטקטורות של מעבדי� מרובי ליבות

המקביליות ארכיטקטורה זו מאפשרת לרתו� את ). CMP(בי� על גבי אותו שבב ולמעבדי� מש

תחת מעטפת הספק �האינהרנטית הקיימת בתוכניות מרובות חוטי� בכדי לקבל ביצועי� מקסימאליי

.נתונה

מעבדי� שימוש במעבדי� בעלי ליבה יחידה למשימוש בתרמו למעבר שני גורמי� מרכזיי�

מספר , ו�כי. התפתחות טכנולוגית היצור של שבבי הסיליקו� בהגור� הראשו� הינו קצ. מרובי ליבות

מימוש של ומאפשר ביעילות הינו גדול מספיק בודד הטרנזיסטורי� אשר נית� לשלב על גבי שבב

עקב מורכבותו . בודדביצועי מעבד שיפורהשני הינו הקושי בהגור� . יחיד מספר מעבדי� על גבי שבב

עבד בודד עבור מימוש חומרתי המספק שיפור בביצועי מההספק והשטח הנדרשי� , של מעבד מודרני

שיפורי� ותמגבלות צריכת ההספק הופכ, בפרט. גדולי� מכדי להיות כדאיי�, ברוב המקרי�, הינ�

משמעותית את צריכת ההספק המימוש מגדילה בצורה סיבוכיות והיות , משתלמי�רבי� ללא

� ק ביצועי� טוביפשל מספר ליבות פשוטות על גבי אותו שבב הינו יעיל יותר ומס �שילוב. הכוללת

. יותר תחת מעטפת הספק נתונה

. מעבדי� מרובי ליבות הינו תכנו� וניהול זיכרו� המטמו�בתכנו� אחד האתגרי� המרכזיי�

חולקות בדר� כלל זיכרו� מטמו� משות! וכ� את הִ%תחות הליבות , במעבדי� מרובי ליבות, ראשית

ה� מבחינת יכולת האכסו� וה� ( גדל הלח# על זיכרו� המטמו�, עקב כ�. לזיכרו� החיצוני ולהתקני�

, יתר על כ�. ספר תהליכי� מתחרי� על אות� משאבי זיכרו� מוגבלי�מכיוו� ש, )פסמבחינת רוחב ה

�את י�זיכרו� החיצוני הופכגישות בתו� השבב לגישות להפערי� במהירות ובצריכת ההספק בי

הפניות למערכת מוז גדול מאד ולכ� חובה להבטיח כי אח, במיוחד ותחיצוני ליקרהלזיכרו� ותהגיש

. זיכרו� המטמו�ימצאו את המידע בהזיכרו� כולה

כיוו� שמהירות וזאת , ע� התקדמות הטכנולוגיה חל גידול בהשהיית החוטי� בשבב, שנית

המרחק בי� מיקו� המידע , כפועל יוצא. הלוגיקה גדלה בקצב מהיר יותר מהשיפור בהשהיית החוטי�

פ� בהדרגה לדומיננטי יותר ויותר בקביעת וההליבה שניגשת למידע � למיקו� הספציפי בזיכרו� המטמו

זמני הגישה למיקומי� רחוקי� בי� כאשר ההבדלי� בי� זמני הגישה למיקומי� קרובי� ל, זמ� הגישה

�, המידע ימצא בזיכרו� המטמו�שעלינו להבטיח לא רק , כעת .ילכו ויסתכמו לכדי עשרות מחזורי שעו

כ� שהזמ� לקבלת הנתו� יהיה קצר לו השנדרש ליבהע ימצא במיקו� קרוב מספיק לשהמידג� אלא

את זיכרו� המטמו� במעבדי� מרובי ליבות לאתגר מרכזי �אלה הופגורמי� השילוב של . ככל האפשר

. שיפור ביצועי המערכתלבדר�

מעויות ישנ� משמרובת החוטי� העבודה תסביבל, עי� ממבנה החומרהבלשיקולי� הנובנוס!

שיתו! מידע , בעבודת מחקר זוכפי שאנו מראי� . הגישות לזיכרו� המטמו� אופיי� שלעמוקות מבחינת

�באופי הגישות דומיננטיי� גורמי� הינ� ,שוני� הפועלי� על ליבות שונותתהליכי� ותקשורת בי

�השימוש בריבוי , בנוס! .לי ליבה בודדתעגורמי� אשר לא קיימי� כלל במעבדי� ב " לזיכרו� המטמו

II

, גישות לזיכרו� חיצונילמס� לשימוש בזיכרו� מטמו� בכדי המהווה חלופ) multi-threading(חוטי�

שיקולי ומציג תייחודיותופעות יוצר) כרו� מטמו� וריבוי חוטי�יז(ומכא� שהשילוב בי� שתי השיטות

. תכנו� חדשי� לגמרי

והשלכותיה של סביבת , �ו� ייחודייהשילוב של מאפייני גישה חדשי� ע� שיקולי תכנ

של זיכרו� המטמו� ולפתרונות תדשובחינה מחלצור� בכול� מובילי� ,העבודה מרובת התהליכי�

החורגי� ממיפוי של פתרונות קיימי� המוכרי� לנו ממעבדי� בעלי ליבה יחידה אל עול� המעבדי�

מעבר לעול� המעבדי� מרובי בה� ה, עבודת מחקר זו עוסקת בשני מקרי� כאלה. מרובי הליבות

.שני של זיכרו� המטמו�ולמבנה חד, הליבות יוצר צור� במודלי� חדשי� לבחינת ביצועי המערכת

אנו מציעי� ארכיטקטורה חדשה לזיכרו� מטמו� עבור , עבודת מחקר זושל הראשו� הבחלק

רכיטקטורה נסמכת הא ).multi-core(מעבדי� מרובי ליבות המורכבי� ממספר מועט יחסית של ליבות

ניתוח אופי הגישות למערכת הזיכרו� בעזרת אותה גילינו , "ותהחמ ותהמשותפשורות "" העל תופעת

אחוז ניכר מהגישות לזיכרו� מורכב מגישות לנתוני� המשותפי� : עבור מספר אפליקציות מייצגות

ניגשת מו� אליה� שורות זיכרו� המטמהווי� אחוז קט� מכלל אילונתוני� . למספר רב של מעבדי�

תופעת השורות המשותפות . של פעמי�רב מתבצעת מספר א� הגישה לשורות אלה , האפקליציה

היות שמשמעותה היא שגישות רבות לזכרו� המטמו� מתבצעות , המעבדמאד בביצועי החמות פוגעת

. ולכ� לוקחות זמ� רב, לה� תנדרשליבה אשר לנתוני� הממוקמי� במרחק גדול מה

השורות המשותפות החמות אנו ממנפי� את תופעת , נהללזכרו� המטמו� תורבארכיטקט

בכדי לתכנ� מעבד מרובה ליבות אשר תוק! ה� את ההשהיה הגדלה והולכת של החוטי� בשבב וה� את

טופולוגיה חדשנית זו לזיכרו� מטמו� נקראת נהלל על שמו של הישוב . בעית הגישה למידע משות!

המוצעת תוכננה ברוחה של תוכנית המתאר של הישוב הבנוי הופולוגיהט. נהלל שבעמק יזרעאל

, בית הספר, חדר תרבות, משרדי�(בטבורו של נהלל ממוקמי� מוסדות הציבור : ממעגלי� קונצנטריי�

, בהיק! החיצוני של טבעת הבתי�. אות� מקיפי� בתי המשפחות היוצרי� את המעגל הפנימי) 'וכו

כאשר כל חלקה נפרסת כמעי� קר� שמש מאחורי בית ,ת של כל משק נפרסות חלקות האדמה הפרטיו

ר על מבנה זה מאפשר גישה מהירה של כל בתי האב אל מוסדרות הציבור ובאותו זמ� שומ. המשפחה

.קירבת כל חלקת אדמה לבעליה

הטופולוגיה המוצעת בעבודה זו משליכה את אותו מבנה קונספטואלי על תסדיר המעבד

כששאר הזיכרו� , חלק קט� מכלל זיכרו� המטמו� ממוק� במרכז השבב ומוק! במעבדי�: הליבותמרובי

הנתוני� אל הזיכרו� הקט� שבטבור מרכזי� את . נפרס בטבעת החיצונית שמאחורי מעגל המעבדי�

הטבעת . ובכ� מאפשרי� גישה מהירה אליה� מכל המעבדי� במערכת, המשותפי� החמי� ביותר

כאשר בכל חצר שכזו ירוכזו הנתוני� הפרטיי� של כל , לכל מעבד" טיתחצר פר"החיצונית יוצרת

ובאותו זמ� לרכז את , לבעליה�יות הפרטשורות תסדיר זה מאפשר לשמר את סמיכות� של ה. מעבד

.רו גישה מהירה מכל המעבדי� שבמערכתשבמרכז השבב כ� שיאפ ותהמשותפהשורות

ניהול דינאמית לזיכרו� המטמו� נהלל מהווה פלטפורמה למימוש שיטת תטופולוגי

)DNUCA( , תת ותמידע נודדשורות אשר בה �יחידות האחסו� במטרה להביא את " בצורה דינאמית בי

III

שיטות ניהול דינאמיות של בעוד ש. �בזיכרו נתוני�המערכת למצב אידיאלי מבחינת מיקומ� של ה

�העובדה , קודמי�ליבות מרובי מעבדי� להביא לשיפור בביצועי ולא הצליחזיכרו� המטמו

מביאה את שיטת הניהול הליבותשטופולוגית נהלל מאפשרת לבלוקי� משותפי� להימצא בקרבת כל

.לכדי מיצוי הפוטנציאל הגלו� בה ומובילה לשיפור ניכר בביצועי מערכת הזיכרו� תהדינאמי

מספר רב מעבדי� מרובי ליבות המורכבי� מבעוסקי� אנו , זועבודת מחקר השני של ה בחלק

את ההשפעה ההדדית בעזרת מודל אנליטי חוקרי�בחלק זה אנו . )many-core(של ליבות פשוטות

� .ואפליקציות מקביליות, ריבוי תהליכי�, זיכרונות מטמו�שבי

שתמשי� מעבדי� מרובי ליבות המיועדי� לתחומי החישוב המקבילי והעיבוד הגרפי מ, כיו�

השיטה הראשונה . וגדלה של גישות לזיכרו� חיצונישהיה ההולכת בכדי להתמודד ע� ההבשתי שיטות

ה הינה יהשיטה השנ. את מספר הפניות לזיכרו� החיצונישמטרתו להקטי� בזיכרו� מטמו�הינה שימוש

דבר המאפשר לחפות על עיכובי� הנובעי� מגישה , תמיכה במספר גדול מאד של חוטי� בו זמנית

מעבדי� שוני� נותני� משקלי� שוני� לכל אחת .חוט אחר לריצהי הכנסת "לזיכרו� של חוט אחד ע

מגוו� , בנוס!. תיי�החל משימוש בשיטה אחת בלבד וכלה בשילובי� שוני� של הש –מהשיטות

מעבר לאפליקציות המסורתיות של עיבוד גרפי . האפליקציות המיועד למחשבי� אלה הול� ומתרחב

את ודבר המרחיב את טווח המאפייני� , ליקציות חדשותנוספות כל העת אפ, וחישוב מקבילי קלאסי

.התוכניותבי� שהשונות

יחד ע� , שתי שיטות שונות להקטנת הפגיעה בביצועי� עקב גישה לזיכרו� חיצוניהשילוב של

חיזוי :המערכת לבעיה סבוכההופכי� את ניתוח , אפייני� שוני�טווח רחב של אפליקציות בעלות מ

מקור ההבדלי� , פק של אפליקציה נתונה על מכונה נתונה הינו קשה לביצועהביצועי� וצריכת ההס

עוד! הפרטי� ומורכבות המכונות מקשה על הבנת "וחמור מכ� , בי� מכונות שונות אינו תמיד ברור

. יחסי הגומלי� וההשפעה הדדית שבי� מאפייני האפליקציות ומאפייני החומרה

המודל . אשר מאפשר התמודדות ע� אתגרי� אלה, שוטבעבודה זו אנו מציגי� מודל אנליטי פ

מכונה אבסטרקטית אשר תומכת בריבוי חוטי� ובזיכרו� "מייצג בצורה פרמטרית ה� את החומרה

�, המקביליות, תוכנית מרובת חוטי� אשר מאפייני הלוקליות –המקביליתוה� את האפליקציה , מטמו

עבור תוכנית נתונה ומכונה , המודל מנבא. יתוניצול המידע שלה ניתני� לקביעה בצורה פרמטר

תיאור רהשימוש במכונה אבסטרקטית מאפש. את ביצועי התוכנית ואת צריכת ההספק שלה, נתונה

. בצורה הוגנתבי� מכונות שונות מאפשר להשוות לכ� ו, של אוס! של� של מכונות תחת אותו מודל

�מכונה תהיה המתאימה לה ר לחזות איזו המודל מאפש ,שבעזרת אפיו� של תוכנית ספציפית, מכא

באופ� "המכונהוה� של האפליקציה ה� של "כל אחד מהפרמטרי�נות היכולת לש, יתרה מכ�. ביותר

שבי� הפרמטרי� ההשפעות ההדדיות מאפשרת לבחו� בצורה איכותית את , בלתי תלוי באחרי�

. ה לתוכנהיחסי הגומלי� שבי� החומרומהווה כלי עזר מרכזי להבנת , השוני�

אנו נעזרי� במודל האנליטי בכדי לחקור בצורה איכותית כיצד משפיעי� הפרמטרי� השוני�

ה� בתוכניות עושי� שימוש אנו . של התוכנה והחומרה על ביצועי המכונה ועל צריכת ההספק שלה

ממצאינו . ונעזרי� בסימולציות בכדי לאמת את נכונות המודל, סינטטיות וה� בתוכניות בוח� אמיתיות

IV

אנו מראי� שעבור , בנוס!. דומהמציגות התנהגות מאפייני� משפחות של אפליקציות אשר

גר! הביצועי� אינו לינארי וסובל מתחו� בו המכונה מספקת ביצועי� נמוכי� , אפליקציות מסוימות

אנו חוקרי� את ההתנהגות האמורה ומציעי� . היחסית לביצועי� המכסימליי� האפשריי� עבור

. � אפשריות לחמוק ממנהדרכי

:התרומות המרכזיות המתוארות במסגרת עבודת מחקר זו הינ�, לסיכו�

ליבות אשר ימרוב י�מעבדלארכיטקטורת זיכרו� מטמו� –ארכיטקטורת נהלל •

טיי� ומאפשרת גישה מבדילה בצורה דינמית בי� נתוני� משותפי� לנתוני� פר

.מהירה לשניה�

המודל מאפשר הבנה של יחסי . די� בעלי ליבות רבותמודל אנליטי פשוט עבור מעב •

הגומלי� בי� האפליקציות המקביליות המודרניות לבי� הדור החדש של המעבדי�

. עיבוד גרפיהמקבילי והחישוב לתחו� ה מרובי הליבות אשר מיועד

על , ה� של החומרהשל התוכנה וה� , השוני� ניתוח איכותי של השפעת הפרמטרי� •

ליבות מתחו� היכת ההספק של הדור החדש של המעבדי� מרובי הביצועי� וצר

. העיבוד המקבילי והחישוב הגרפי

שבו קיי� תחו� ביניי� העבור, עבור משפחה של אפליקציות תזיהוי התנהגות ייחודי •

.המכסימליי� האפשריי� �לביצועיירודי� יחסית הינ� ביצועי המכונה

Documents

PhD thesis v292webee.technion.ac.il/people/idish/ftp/Zvika-Guz-PhD... · 2010-11-08 · Technion – Israel Institute of Technology Heshvan 5771 Haifa October 2010 . Acknowledgements