6
Heterogeneous Architecture Design with Emerging 3D and Non-Volatile Memory Technologies Qiaosha Zou , Matthew Poremba , Rui He , Wei Yang , Junfeng Zhao , Yuan Xie Computer Science and Engineering, The Pennsylvania State University, USA Huawei Shannon Lab, China Electrical and Computer Engineering, University of California, Santa Barbara, USA Email: {qszou, mrp5060}@cse.psu.edu, {ray.herui,william.yangwei,junfeng.zhao}@huawei.com , [email protected] Abstract—Energy becomes the primary concern in nowadays multi-core architecture designs. Moore’s law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design. 1 I. I NTRODUCTION As Moore’s law predicted, the number of transistors doubles every 18 months. However, the performance is not expo- nentially scaled because it is restricted by the scaling speed mismatch of power consumption and memory bandwidth. Furthermore, in nowadays computing systems, energy effi- ciency becomes the primary concern during system design. The traditional scale out strategy by packing more cores into a single chip is no longer power sustainable. The dark silicon concept emerges as a 2× shortfall occurs when powering a chip at its native frequency [1]. Therefore, the utilization rate of cores in the same area drops exponentially across generations. Fortunately, single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The heterogeneous multi-core combines the conventional processors with the emerging com- puting fabrics, such as GPGPUs and NVMs. Meanwhile, the slowly growing number of pin count poses another challenge on both homogeneous and heterogeneous systems. Compared to the exponential growing rate of tran- sistors, pin counts only increase linearly, resulting in scare bandwidth resources. Moreover, because of the application variety and blooming of media and streaming processing, the bandwidth is a crucial factor for performance and through- put. Expanding chips on the third direction, which is three 1 Zou, Poremba, and Xie were supported in part by NSF 1218867, 1213052, 1409798, and 1017277 and SRC grants dimensional integrated circuits (3D ICs), can alleviate the pin count limitation with low latency, high bandwidth vertical connections. Since the processing isolation between chips, 3D ICs further accelerates the adoption of heterogeneous system in a cost-efficient fashion. Figure 1 shows a sketch of future 3D computing system with layers of CPUs, GPUs and accel- erators and on-chip hybrid memories are placed by the side of computing fabrics with interposer [2, 3]. TSVs (through- silicon vias) are used as the vertical connection between each tier, providing high bandwidth, low latency interconnects. !"#"$%&'()$*+,"'!(-, ./01 (21 1"3+$4' 5#6"$7%8" 9:;'0$$%4 (<'0$$%4 (<'0$$%4 088"&"$%6+$, !"#"$%&'()$*+,"'2(-, ./01 ./01 === 1"3+$4' 5#6"$7%8" 5#6"$*+,"$ >?@'2+$", :3%&&'2+$", Fig. 1. Overview of future 3D computing system combining CPUs, GPUs, accelerators, on-chip memory, and off-chip hybrid memory. In the 3D heterogeneous system, digital and analog fabrics can be integrated in a single chip while each component can be optimized separately and has its own optimal clock frequency, supply voltage, and even technology node. In the digital part, several layers of CPUs are built containing multiple cores with diverse computing capabilities and pipeline depths for power efficiency. Tiers of GPUs are stacked providing massive processing elements for highly parallelism. Numerous accelerators that are dedicated for the target applications are designed by requirement and integrated at low cost. Moreover, on-chip memory is stacked to satisfy the increasing memory bandwidth demand through short latency TSV arrays. Off-chip hybrid memory moves in closer proximity with the processing chip thanks to the interposer. The CPUs and GPUs connect with off-chip memory through specific memory interface and high bandwidth on-chip interconnects. In this paper, three major heterogeneous integration strate- gies in the multi-core field are summarized. The first one is the single-ISA heterogeneous multi-core architectures in 978-1-4799-7792-5/15/$31.00 ©2015 IEEE 9B-1 785

Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

Embed Size (px)

Citation preview

Page 1: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

Heterogeneous Architecture Design with Emerging

3D and Non-Volatile Memory TechnologiesQiaosha Zou∗, Matthew Poremba∗, Rui He†, Wei Yang†, Junfeng Zhao†, Yuan Xie‡

∗ Computer Science and Engineering, The Pennsylvania State University, USA† Huawei Shannon Lab, China

‡ Electrical and Computer Engineering, University of California, Santa Barbara, USA

Email: ∗{qszou, mrp5060}@cse.psu.edu, †{ray.herui,william.yangwei,junfeng.zhao}@huawei.com

,‡[email protected]

Abstract—Energy becomes the primary concern in nowadaysmulti-core architecture designs. Moore’s law predicts that theexponentially increasing number of cores can be packed into asingle chip every two years, however, the increasing power densityis the obstacle to continuous performance gains. Recent studiesshow that heterogeneous multi-core is a competitive promisingsolution to optimize performance per watt. In this paper, differenttypes of heterogeneous architecture are discussed. For each type,current challenges and latest solutions are briefly introduced.Preliminary analyses are performed to illustrate the scalabilityof the heterogeneous system and the potential benefits towardsfuture application requirements. Moreover, we demonstrate theadvantages of leveraging three-dimensional (3D) integration onheterogeneous architectures. With 3D die stacking, disparatetechnologies can be integrated on the same chip, such as theCMOS logic and emerging non-volatile memory, enabling a newparadigm of architecture design. 1

I. INTRODUCTION

As Moore’s law predicted, the number of transistors doubles

every 18 months. However, the performance is not expo-

nentially scaled because it is restricted by the scaling speed

mismatch of power consumption and memory bandwidth.

Furthermore, in nowadays computing systems, energy effi-

ciency becomes the primary concern during system design.

The traditional scale out strategy by packing more cores into

a single chip is no longer power sustainable. The dark silicon

concept emerges as a 2× shortfall occurs when powering

a chip at its native frequency [1]. Therefore, the utilization

rate of cores in the same area drops exponentially across

generations. Fortunately, single-chip heterogeneous multi-core

is one potential solution to balance the power consumption

and performance enhancement. The heterogeneous multi-core

combines the conventional processors with the emerging com-

puting fabrics, such as GPGPUs and NVMs.

Meanwhile, the slowly growing number of pin count poses

another challenge on both homogeneous and heterogeneous

systems. Compared to the exponential growing rate of tran-

sistors, pin counts only increase linearly, resulting in scare

bandwidth resources. Moreover, because of the application

variety and blooming of media and streaming processing, the

bandwidth is a crucial factor for performance and through-

put. Expanding chips on the third direction, which is three

1Zou, Poremba, and Xie were supported in part by NSF 1218867, 1213052,1409798, and 1017277 and SRC grants

dimensional integrated circuits (3D ICs), can alleviate the pin

count limitation with low latency, high bandwidth vertical

connections. Since the processing isolation between chips, 3D

ICs further accelerates the adoption of heterogeneous system

in a cost-efficient fashion. Figure 1 shows a sketch of future

3D computing system with layers of CPUs, GPUs and accel-

erators and on-chip hybrid memories are placed by the side

of computing fabrics with interposer [2, 3]. TSVs (through-

silicon vias) are used as the vertical connection between each

tier, providing high bandwidth, low latency interconnects.

����������� ���� ����

���������������������������������� �������

�����������

�������������������� ����

�������������

����

���

����������������

������� ��

� !����� ����������

Fig. 1. Overview of future 3D computing system combining CPUs, GPUs,accelerators, on-chip memory, and off-chip hybrid memory.

In the 3D heterogeneous system, digital and analog fabrics

can be integrated in a single chip while each component can be

optimized separately and has its own optimal clock frequency,

supply voltage, and even technology node. In the digital

part, several layers of CPUs are built containing multiple

cores with diverse computing capabilities and pipeline depths

for power efficiency. Tiers of GPUs are stacked providing

massive processing elements for highly parallelism. Numerous

accelerators that are dedicated for the target applications are

designed by requirement and integrated at low cost. Moreover,

on-chip memory is stacked to satisfy the increasing memory

bandwidth demand through short latency TSV arrays. Off-chip

hybrid memory moves in closer proximity with the processing

chip thanks to the interposer. The CPUs and GPUs connect

with off-chip memory through specific memory interface and

high bandwidth on-chip interconnects.

In this paper, three major heterogeneous integration strate-

gies in the multi-core field are summarized. The first one

is the single-ISA heterogeneous multi-core architectures in

978-1-4799-7792-5/15/$31.00 ©2015 IEEE

9B-1

785

Page 2: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

Section II. Specifically, we focus on the core integration of

different technology nodes utilizing 3D stacking. This kind

of heterogeneity can maximize the performance within given

cost and power budget. The second comes with the most

popular heterogeneous ISA architecture that integrates con-

ventional CPUs and other unconventional processing elements

(GPUs, FPGAs, and accelerators). We mainly focus on the

integration of CPU and GPU in Section III. The performance

on both latency-aware and throughput-aware applications can

be beneficial from this integration. This strategy is advocated

by numerous industrial products, such as AMD Fusion, Intel

Sandy Bridge, and NVidia Tegra [4, 5, 6]. The last one in

Section IV is the combination of memories with different

materials, namely, the integration of traditional DRAM/SRAM

technology and the emerging non-volatile memory (NVM).

Leveraging the low standby power property of NVMs and

short access latency of traditional memories, the memory

bandwidth can be guaranteed with affordable power consump-

tion.

II. SINGLE-ISA HETEROGENEOUS MULTI-CORE

ARCHITECTURES

The applications nowadays show enormous diversity on

the demands of computing resources. Furthermore, the re-

quirement varies during different execution phases even in

the same program. Therefore, providing a uniform multi-

core architecture with the general-purpose computing capa-

bility is over-provisioned, stressing the power management.

Instead of packing more and more powerful cores, designers

are switching to seek the solution using cores with various

computing capabilities to provide just enough service. The

most efficient design without involving much re-design efforts

is the single-ISA (instruction set architecture) multi-core ar-

chitecture, which keeps the cores have the same ISA, only

varying the computing resources .

The prevalent architecture design with single-ISA is the

combination of high performance cores (big core) and low

power cores (small core). Big cores have higher performance

to handle compute intensive and latency sensitive applications

at the cost of higher power consumption. Small cores are the

simple and low power processors (i.e. in-order processors)

with lower throughput, however, they are more energy-efficient

towards the applications that are memory intensive. The per-

formance and power modelings of this multi-core architecture

are first conduct in the exploration on Alpha cores [7].

The researchers examine the energy saving and performance

degradation of SPEC benchmarks over the architecture con-

taining four cores: two in-order small cores and two out-of-

order big cores. The results show that a 39% average energy

reduction is achieved with only 3% performance degradation.

The energy saving is even better than applying the dynamic

voltage/frequency scaling.

The idea of big core and small core integration is promoted

by industry as ARM announced their heterogeneous multi-

core, named big.LITTLE [8]. In their design, Cortex-A15,

which is a triple-issue out-of-order processor, works as the

big core, while Cortex-A7, an in-order, non-symmetric dual-

issue processor, is the little core. In general, Cortex-A15 has

2-3× higher performance than A7, however, A7 is about 3-4×more energy efficient due to the different pipeline lengths.

Due to the various energy and performance characteristics in

the heterogeneous system, the application scheduling on the

appropriate core is crucial. Moreover, since the application

requirement changes along the execution phases, it is neces-

sary for dynamic application mapping and migration. Recently,

substantial research efforts are made to accurately model the

application performance under different cores [9] and conduct

application mapping/scheduling and migration [10, 11, 12].

An analytical performance model is used and the fundamen-

tal design tradeoffs of single-ISA heterogeneous multi-core

is studied [9]. In addition to the whole system throughput

as focused by previous work, the per-program performance

is also considered as one of the design metrics. From the

determination of the frontier of Pareto-optimal architectures,

it is found that there is no such an optimal configuration

in heterogeneous system that can balance the per-program

performance and system throughput. Fundamentally, the single

performance is traded for the system throughput. Moreover,

the effectiveness of heterogeneity is heavily depended on the

job mapping.

In general, the task scheduling can be performed statically

or dynamically. The application characteristic can be extracted

before execution and be used to determine the suitable core

statically [13]. For dynamic application mapping, characteris-

tics of tasks and processors, such as power, utilization, and

bandwidth requirement, are monitored. The appropriate core

is then selected for better speedup or power-efficiency. For

instance, the balance of power-performance guides the task-to-

core mapping utilizing the price theory [10]. The application

characteristics vary over time, thus a predictive trace-based

controller predicts the upcoming phase and migrates the exe-

cution accordingly [11]. The processor utilization affects the

system performance and energy, thus, a utilization based load

balancing demonstrated by Linux is proved to reduce energy

up to 11.35% [12].

In addition to the basic integration of big core and small

core, heterogeneous multi-core design with different technol-

ogy nodes is another possible solution for power efficient

architecture. Moreover, the kind of design can be cost-efficient

as we can reduce the cost by integrating components from

earlier and mature fabrication processes. As shown in Figure 2,

at the beginning of new technology node, the cost is extremely

high compared to primitive technologies. However, the new

technology guarantees the integration density and transistor

speed, driving the industry to adopt this technology. On the

other hand, in addition to the cost, the ever growing power

density and leakage power of smaller feature size make the

previous technologies appealing. Therefore, instead of building

all the cores with the same technology, we can have some

cores using the older technologies and use the price gap to

integrate more cores. For example, if the transistor cost in

technology N is four times of the cost in technology N-1. We

can integrate up to four cores with node N-1 with the same

cost as integrating only one core with node N.

For cores in different technology generations, the clock

frequency and power supply become the major challenges,

9B-1

786

Page 3: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

Nor

mal

ized

tran

sist

or C

ost

0

0.25

0.5

0.75

1

TimeLine

1Q10 3Q10 1Q11 3Q11 1Q12 3Q12 1Q13 3Q13 1Q14 3Q14

40nm 28nm 20nm

Fig. 2. The normalized transistor cost of different technology nodes. [14]

as the intrinsic threshold voltage of technology determines

the minimum voltage supply and the corresponding maximum

clock frequency. Therefore, building cores with different tech-

nology nodes on the single die requires sophisticated power

and clock designs. Fortunately, the emerging 3D integration

can solve this problem by constructing the cores on different

dies [15]. In each die, it can has its own power domain

and clock network without interfering others under the help

of cross power domain interface. Moreover, the fabrication

process can be simplify because each die is processed sepa-

rately. Therefore, designers can focus on their own design and

different companies can cooperate easily.

III. HETEROGENEOUS ARCHITECTURES CONTAINING

CPU AND GPUS

The energy efficiency is forcing the microprocessor design-

ers to exploit new architectures with sustainable performance

per watt. In the previous section, the single-ISA multi-core

architecture is illustrated. However, the single-ISA heterogene-

ity still has limitation all the cores are only good at handling

latency-aware applications. Therefore, recent researches have

found that combining the advantages of CPU and GPU can

sustain the performance improvement of both latency and

throughput aware applications in an energy efficient way [16].

CPU is powerful in sequential computing for lower latency

while GPU provides higher throughput with massive parallel

data processing.

The general purpose GPU was first brought out for

throughput-oriented graphics applications because it contains

a large number of processing elements. These elements can be

operated in parallel with simple instruction yet massive data

in any types. However, the GPU can not be a substitute for

CPU, because the operation frequency of GPUs is much slower

than that of CPUs. CPUs can handle complex instructions at

higher frequency and the latency is of the highest priority. For

example, in Intel core i7, the CPU clock rate is about 3GHz

while the frequency of recent GPU is 1300MHz [17, 18].

Moreover, even GPU provides higher throughput, for the

applications that have little margin for parallelism, there is

no significant performance gain.

Due to the distinct bandwidth requirements, usually CPUs

and GPUs are using different memories. When combining

CPUs and GPUs for cooperation, the data sharing and move-

ments become the bottleneck for performance improvement.

For a single application with both serial and parallel sections,

the data exchanges are necessary between CPUs and GPUs.

The data that are needed by GPUs should be copied from

CPUs via interconnect, PCIe, for example. However, the

bandwidth of PCIe (about 20GB/s [19]) is dramatically smaller

than the bandwidth of GPU memory (more than 200GB/s),

resulting in large data sharing latency.

Fortunately, integrated CPUs and GPUs with shared mem-

ory is proposed. The extensive data duplications between CPU

and GPU is eliminated thanks to the support of a unified mem-

ory address space. Nevertheless, the introduction of unified

memory space burdens the shared memory bandwidth and the

memory request scheduling. The latency sensitive feature of

CPUs indicates low tolerance on memory latency. Even GPUs

have little requirement on memory latency as they are designed

to hide the long latency through thread level parallelism, they

occupy high memory bandwidth for a relatively long period.

The integration causes the bandwidth loss due to the sharing,

and the performance is thus degraded.

There are two directions for solving the shared bandwidth

problem. The first is to device a high bandwidth memory to

provide adequate resource for both CPUs and GPUs. The 3D

stacked memory is one competitive solution, as demonstrated

by Hybrid Memory Cube [20] and High Bandwidth Mem-

ory [21]. According to the specifications, HMC can provide up

to 320GB/s bandwidth while HBM is dedicated for graphic

applications with 256GB/s bandwidth.

Another direction is to intelligently schedule the memory

requests for fair and efficient bandwidth sharing [22, 23, 24].

The bandwidth can be reduced from the main memory level

or the cache level. Staged memory is proposed to decouple

the memory controller’s task into three stages. The first stage

groups requests based on row-buffer locality, and the second

stage schedule the inter-application requests. The last level

deals with the DRAM commands and timing to perform final

data operations [22]. Unlike previous studies that focus on the

multiple applications scenarios, the memory scheduling on a

single parallel application is dramatically different. Various

optimization techniques are proposed to enhance the row-

buffer locality and the overall throughput is improved up to

8% [24]. The last level cache becomes effective at hiding the

memory latency only when the effect of multi-threading in

GPUs is invalid. Therefore, it is not necessary to increase the

cache hit rate for GPGPUs. Consequently, a core sampling

mechanism is proposed to exploit the symmetric behavior

of GPGPU applications and two new thread-level parallelism

aware cache management strategies are proposed [23]. The

available bandwidth for CPUs can be even worse if the

coherence requests affect the bandwidth. Traditional directory-

based coherence design is hard to directly applied on the

heterogeneity scenario. A region directory replaces traditional

directory and region buffers are added for both CPU and

GPU L2 caches to track the permission regions. By moving

the coherent requests onto the incoherent direct-access bus,

the bandwidth to the directory is reduce by an average of

94% [25].

9B-1

787

Page 4: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

In addition to the bandwidth allocation, another challeng-

ing task is how to efficiently execute tasks on the system,

including the programming model and task scheduling from

both software and hardware aspects [26, 27, 28, 29]. In the

heterogeneous platform, the ISA and functionality of GPUs

are fundamentally different from the general purpose CPUs.

Therefore, the traditional and developing applications would

be tailored to fit the new architecture with great effort given

the developers may not be familiar with the new features.

OpenCL [26] is emerging as the first open standard for cross-

platform, parallel programming of modern architectures. In

addition to the open standard, several studies vary the basic

programming model for their own specific target systems.

An integrated C/C++ programming environment supporting

specialized cores is proposed for their heterogeneous ISA-

based MIMD architectures [27]. An OpenCL-based frame-

work, SnuCL, is proposed to get OpenCL applications portable

between compute devices in the heterogeneous clusters [29].

Performance modeling is also an interesting topic in het-

erogeneous system as it can predict the scalability at the early

design stage. One of the simple analytical model that can

applied is Amdahl’s Law. Previous studies have extended the

Amdahl’s Law into the heterogeneous computing era [30, 31].

The future computer will integration various unconventional

computing units (GPGPUs, FPGAs, and ASICs) as suggested

by the measurements and predictions [30]. In addition, the

study also shows that sufficient parallelism is the prerequisite

for the significant performance gains from heterogeneous

computing. Moreover, bandwidth is the first-order concern in

developing efficient system.

Two processing modes are available for CPUs and GPUs

integration: asymmetric and simultaneous asymmetric. The

first mode divides a program into three segments: serial

execution on one CPU, parallel execution on multi-cores,

and parallel execution on GPUs. The second mode schedules

different programs onto CPUs and GPUs which computes

simultaneously. Amdahl’s Law is revised to capture the system

configuration leading to the optimal speedup [31]. The study

is based on the structure that CPU and GPU share the same

memory space. We extend the work to take the data sharing

delay into consideration and compare the speedups between

unified and separate memory space.

Similar to the definition in previous work [31], the total

numbers of CPUs and GPUs are denoted by c and g. However,

due to the power constraint which is measured by a single

CPU power consumption, the number of GPUs is limited by

(PB − c)/wg , where PB is the power budget and wg is the

power ratio between GPU and CPU. Assume the portion of

the program can be paralleled is f and the portion of parallel

program running on multi-cores is α. β is the execution

timing ratio of GPU to CPU. γ is the data sharing latency

normalized to the program computation time on a single CPU.

For memory intensive applications, the value of γ can be

larger than 1. Note that, we assume that the memory latency

of data sharing between CPUs is negligible. The theoretical

asymmetric speedup is represented as follows:

Speedup =1

(1− f) + αfc

+ γ + (1−α)fgβ

(1)

Figure 3 shows the speedups for a program running on

CPU+GPU system. We vary the CPU core from 1 to 99 and

examine the speedup when γ equals 0, 0.1, 0.5, 1.0, and 2.5.

The result curve indicates that under the power constraint,

the speedup increases at first with more CPUs, then after a

turning point, integrating more CPUs results in performance

degradation because of the reduced parallel components (one

CPU power consumption equals to four GPUs). Furthermore,

when the data sharing delay is sufficient large, there is no

performance benefit of CPU+GPU integration. The significant

benefit of unified memory space is indicated from this result.

Spe

edup

0

1

2

3

4

CPU Count

1 10 100

0 0.1 0.5 1.0 2.5

Fig. 3. Speedups considering the data sharing delay. The x-axis is in logarithmscale. f = 0.7, α = 0.5, β = 0.5, and wg = 0.25.

IV. HETEROGENEOUS MEMORY ARCHITECTURES

The memory and storage system is a critical component

in various computer systems. More and more applications

are shifting from computing bounding to data bounding. A

hierarchy of memory and storage components are used to effi-

ciently store and manipulate a large amount of data. However,

the performance and energy scaling of main-stream memory

technologies cannot catch up the requirements of current

computing systems. The commodity memory technologies,

such as SRAM and DRAM, are facing scalability challenges

due to the limitations of their device cell size and power

dissipation. In particular, the leakage power of SRAM and

DRAM and the refresh power of DRAM are increasing, which

contribute a significant portion of overall system energy [32].

Therefore, an energy efficient memory subsystem is in urgent

need to continue the system improvement.

Recently, emerging byte-addressable nonvolatile memory

technologies have been studied as the replacement of tradi-

tional memories due to their promising characteristics: higher

density, lower leakage power, and non volatility. The represen-

tative technologies include spin-transfer torque memory (STT-

RAM), phase-change memory (PCM), and resistive memory

(ReRAM) [33, 34, 35]. Nevertheless, several shortcomings of

NVMs, such as high write latency/power and low endurance,

impede the direct adoption and fully replacement of NVMs.

9B-1

788

Page 5: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

Heterogeneous integration with SRAM/DRAM and NVMs is

one potential solution towards the design challenges.

Most NVM technologies are not compatible with CMOS

technology which is the traditional technology used to im-

plement the digital logics. Consequently, with most types of

NVMs, silicon interposer or 3D stacking is leveraged for the

implementation [36]. Layers of different NVM technologies

and traditional SRAM/DRAM can be integrated. Moreover,

the 3D stacked memory can increase the memory capacity

and bandwidth in a cost and energy efficient fashion. For

instance, Sun et al. [33] proposed a 3D cache architecture

design with STT-RAM as L2 cache. The system power is

reduced by 70% and performance is moderately improved as

demonstrated by the study due to the non volatility property

of STT-RAM. CMPs are vulnerable to soft errors, which can

be eliminated by stacking all levels of memory hierarchy with

STT-RAM. The system performance is improved by 14.5%

with 13.44% power reduction [37]. The non-volatility and

soft-error resistivity features of NVM make it attractive to

FPGAs. The 3D PCM-based FPGA exhibits advantages over

3D FPGAs with traditional memory technologies in terms of

power consumption, wirelength, and critical path delay [38].

Main memory needs to be sufficiently large to hold most of

the data in applications. Commodity computer systems lever-

age DRAM as main memory, which may not be sustainable

due to the inherent scalability fo DRAM. Among various

NVMs, PCM is believed to be the best candidate for main

memory. However, using PCM alone as main memory has

endurance and power problem as indicated by Lee et al. [34].

From the study, they show that a pure PCM-based main

memory can be 1.6× slower and consumes 2.2× higher energy

than DRAM-based memory due to the high write latency and

energy consumption. In the DRAM and PCM hybrid memory,

DRAM can work as a buffer to serve memory writes with

low latency or it can be placed parallel with PCM to share

portion of the memory requests. Qureshi et al. [39] propose the

main memory design with a PCM region and a small DRAM

buffer. Their study show that a 3× speedup can be achieved

due to the benefit from both short latency of DRAM and high

capacity of PCM. Ramos et al. [40] study the energy-delay2

(ED2) of the hybrid system when pages are migrating between

DRAM and PCM by monitoring access patterns. Their system

is more robust and has lower ED2 through the simulation of

27 workloads (SPEC, SPEC2006, and Stream suites).

Lately, the big data application emerges as one mainstream

application on the datacenter and warehouse-scale computing.

The enormous memory footprint and energy consumption

make the DRAM+PCM hybrid memory more attractive. In

this section, we preliminarily explore the latency and energy

of these two different hybrid memory styles (DRAM as buffer

or main memory) under big data applications. We extract the

memory traces of 6 applications (aggregation, join, kmeans,

pagerank, select, terasort) from big data benchmark [41]

and two traditional applications (volrend, radiosity) from

SPLASH-2 [42]. Then the traces are replayed using the non-

volatile memory simulator NVMain [43] for latency and power

estimation. The total off-chip memory capacity keeps constant

as 4GB and the frequency of memory is 800MHz. When the

system contains both DRAM and PCM as main memory, the

DRAM capacity is 1GB while the PCM capacity is 3GB.

When the DRAM is used as cache, the capacity is 32MB.

Figure 4 shows the average latency and power consumption

of four memory configurations: pure DRAM, pure PCM,

DRAM+PCM, and DRAM as cache. Due to the long write

latency of PCM, pure PCM has the highest latency compared

to other three cases. It is obvious that pure DRAM has the

lowest latency. The latency of DRAM cache is better in three

applications (pagerank, volrend, radiosity), especially in the

traditional applications, due to the relatively higher cache hit

rate. The largest latency of volrend occurs in the DRAM+PCM

case because most of the memory accesses go to PCM, which

is also suggested by the power consumption. In the power

consumption part, the low hit rate of DRAM cache in big

data benchmarks leads to the high power consumption. More-

over, due to the metadata which are located in DRAM, the

DRAM cache has almost the same level of power consumption

compared to the pure DRAM cases. The power consumption

of DRAM+PCM is very close to pure PCM because of the

relatively small DRAM capacity.

Ave

rage

Lat

ency

(C

lock

Cyc

les)

0

300

600

900

1200

Benchmark

aggregation join kmeans pagerank select terasort volrend radiosity

DRAM PCM Parallel DRAM Cache

(a) Average Latency

Ave

rage

Pow

er (

W)

0

0.35

0.7

1.05

1.4

Benchmark

aggregation join kmeans pagerank select terasort volrend radiosity

DRAM PCM Parallel DRAM Cache

(b) Power Consumption

Fig. 4. The comparison of the average latency and power consumption forfour memory configurations: pure DRAM, pure PCM, DRAM+PCM, andDRAM cache.

There is no single winner as illustrated by the results.

However, we can conduct some optimizations towards the

hybrid memory design to mitigate the long latency of PCM and

high energy of DRAM. For example, we can use hierarchical

metadata placement to reduce the access amount of DRAM

with energy efficiency [44]. We can also develop an intelligent

9B-1

789

Page 6: Heterogeneous Architecture Design with Emerging … single-chip heterogeneous multi-core is one potential solution to balance the power consumption and performance enhancement. The

data placement to balance the workloads between DRAM and

PCM [45] and increase the cache hit rate.

V. CONCLUSION

The heterogeneous multi-core architecture promises the

power sustainable performance scaling in future computer

systems. In addition to the traditional CPUs and memories,

emerging computing fabrics and non-volatile memory are

engaged to reduce the performance per watt. In this paper,

three typical heterogeneous systems are introduced: single-ISA

multi-core, integration of CPU and GPU architecture, and the

hybrid memory system. Moreover, the heterogeneous system

can leverage 3D integration to further expand the design space.

Despite the promising features of heterogeneity, several keen

challenges should be tackled, such as task mapping/scheduling

and application-specific design optimizations.

REFERENCES

[1] M. Taylor, “Is dark silicon useful? harnessing the four horsemen ofthe coming dark silicon apocalypse,” in Design Automation Conference,2012.

[2] Y. Koizumi, N. Miura, E. Sasaki, Y. Take, H. Matsutani, T. Kuroda,H. Amano, R. Sakamoto, M. Namiki, K. Usami, M. Kondo, andH. Nakamura, “A scalable 3D heterogeneous multi-core processor withinductive-coupling thruchip interface,” IEEE Micro, vol. 33, pp. 6–15,2013.

[3] S. Borkar, “3D integration for energy efficient system design,” in Design

Automation Conference, 2011.[4] A. Branover, D. Foley, and M. Steinman, “AMD Fusion APU: Llano,”

IEEE Micro, vol. 32, pp. 28–37, 2012.[5] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, “A fully integrated

multi-CPU, GPU and memory controller 32nm processor,” in IEEE

International Solid-State Circuits Conference, 2011.[6] “Nvidia Tegra,” http://www.nvidia.com/object/white-papers.html.[7] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas,

“Single-ISA heterogeneous multi-core architectures for multithreadedworkload performance,” in International Symposium on Computer Ar-

chitecture, 2004.[8] “ARM big.LITTLE technology,” http://www.arm.com/products/processors/

technologies/biglittleprocessing.php.[9] K. Van Craeynest and L. Eeckhout, “Understanding fundamental design

choices in single-isa heterogeneous multicore architectures,” ACM Trans.

Archit. Code Optim., vol. 9, pp. 1–32, 2013.[10] T. Somu Muthukaruppan, A. Pathania, and T. Mitra, “Price theory based

power management for heterogeneous multi-cores,” in International

Conference on Architectural Support for Programming Languages and

Operating Systems, 2014.[11] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “Trace based phase

prediction for tightly-coupled heterogeneous cores,” in International

Symposium on Microarchitecture, 2013.[12] M. Kim, K. Kim, J. R. Geraci, and S. Hong, “Utilization-aware load bal-

ancing for the energy efficient operation of the big.LITTLE processor,”in Design, Automation and Test in Europe, 2014.

[13] J. Chen and L. John, “Efficient program scheduling for heterogeneousmulti-core processors,” in Design Automation Conference, 2009.

[14] “Cost scaling trend,” http://www.extremetech.com/computing/123529-nvidia-deeply-unhappy-with-tsmc-claims-22nm-essentially-worthless.

[15] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, “Design space explorationfor 3D architectures,” Journal of Emerging Technology Computing

Systems, vol. 2, pp. 65–103, 2006.[16] “Amd heterogeneous system architecture,”

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/.

[17] “Intel core i7,” http://ark.intel.com/products/37148/.[18] “Nvidia GeForce GTX 980,” http://www.geforce.com/hardware/desktop-

gpus/geforce-gtx-980/specifications.[19] “PCI Express 3.0,” https://www.pcisig.com/specifications/pciexpress/base3/.[20] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM archi-

tecture increases density and performance,” in Symposium on VLSI

Technology, 2012.[21] JEDECHBM, “http://www.jedec.org/category/technology-focus-area/3d-

ics-0.”

[22] R. Ausavarungnirun, K.-W. Chang, L. Subramanian, G. Loh, andO. Mutlu, “Staged memory scheduling: Achieving high performanceand scalability in heterogeneous systems,” in International Symposium

on Computer Architecture, 2012.[23] J. Lee and H. Kim, “TAP: A TLP-aware cache management policy for a

CPU-GPU heterogeneous architecture,” in International Symposium on

High Performance Computer Architecture, 2012.[24] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, “Memory scheduling

towards high-throughput cooperative heterogeneous computing,” in In-

ternational Conference on Parallel Architectures and Compilation, 2014.[25] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill,

S. K. Reinhardt, and D. A. Wood, “Heterogeneous system coherence forintegrated CPU-GPU systems,” in IEEE/ACM International Symposium

on Microarchitecture, 2013.[26] “Opencl api,” https://www.khronos.org/opencl/.[27] P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar,

N. Y. Yang, G.-Y. Lueh, and H. Wang, “Exochi: Architecture andprogramming environment for a heterogeneous multi-core multithreadedsystem,” in Conference on Programming Language Design and Imple-

mentation, 2007.[28] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPU-CPU work-

loads and systems,” in Workshop on General-Purpose Computation on

Graphics Processing Units, 2010.[29] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: An OpenCL

framework for heterogeneous CPU/GPU clusters,” in International Con-

ference on Supercomputing, 2012.[30] E. Chung, P. Milder, J. Hoe, and K. Mai, “Single-chip heterogeneous

computing: Does the future include custom logic, FPGAs, and GPG-PUs?” in International Symposium on Microarchitecture, 2010.

[31] A. Marowka, “Extending Amdahl’s law for heterogeneous computing,”in International Symposium on Parallel and Distributed Processing with

Applications, 2012.[32] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller,

“Energy management for commercial servers,” Computer, vol. 36,no. 12, pp. 39–48, Dec 2003.

[33] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novel architecture of the3D stacked MRAM L2 cache for CMPs,” in International Symposium

on High Performance Computer Architecture, 2009, pp. 239–249.[34] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change

memory as a scalable DRAM alternative,” in International Symposium

on Computer Architecture, 2009, pp. 2–13.[35] C. Xu, X. Dong, N. Jouppi, and Y. Xie, “Design implications of

memristor-based RRAM cross-point structures,” in Design, Automation

Test in Europe Conference, 2011.[36] Y. Xie, “Future memory and interconnect technologies,” in Design,

Automation Test in Europe Conference Exhibition, 2013.[37] G. Sun, E. Kursun, J. Rivers, and Y. Xie, “Exploring the vulnerability

of CMPs to soft errors with 3D stacked non-volatile memory,” inInternational Conference on Computer Design, 2011.

[38] Y. Chen, J. Zhao, and Y. Xie, “3D-NonFAR: Three-dimensional non-volatile FPGA architecture using phase change memory,” in Interna-

tional Symposium on Low-Power Electronics and Design, 2010.[39] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-

mance main memory system using phase-change memory technology,”in International Symposium on Computer Architecture, ser. ISCA ’09.New York, NY, USA: ACM, 2009, pp. 24–33.

[40] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page placement in hybridmemory systems,” in International Conference on Supercomputing,2011.

[41] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi,S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench:A big data benchmark suite from internet services,” in International

Symposium on High Performance Computer Architecture, 2014.[42] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH-

2 programs: characterization and methodological considerations,” inInternational Symposium on Computer Architecture, 1995.

[43] M. Poremba and Y. Xie, “NVMain: An architectural-level main memorysimulator for emerging non-volatile memories,” in IEEE Computer

Society Annual Symposium on VLSI, 2012.[44] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling

efficient and scalable hybrid memories using fine-granularity DRAMcache management,” Computer Architecture Letters, vol. 11, pp. 61–64,2012.

[45] M. Pavlovic, N. Puzovic, and A. Ramirez, “Data placement in HPCarchitectures with heterogeneous off-chip memory,” in International

Conference on Computer Design, 2013.

9B-1

790