An Embedded Hybrid-Memory-Aware Virtualization Layer for ... · B.3 [Design Styles]: Virtual Memory; D.4 [Storage Management]: Distributed memories General Terms Algorithms, Design,

Center for Embedded Computer Systems University of California, Irvine ____________________________________________________

An Embedded Hybrid-Memory-Aware Virtualization Layer for Chip-Multiprocessors

Luis Angel D. Bathen and Nikil D. Dutt

Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-2620, USA

{lbathen,dutt}@uci.edu

CECS Technical Report #12-03 March 27, 2012

An Embedded Hybrid-Memory-AwareVirtualization Layer for Chip-Multiprocessors

Luis Angel D. Bathen and Nikil D. DuttCenter for Embedded Computer Systems, University of California, Irvine, Irvine, CA, USA

{lbathen,dutt}@uci.edu

AbstractHybrid on-chip memories that combine Non-Volatile Memories (NVMs)with SRAMs promise to mitigate the increasing leakage power of traditionalon-chip SRAMs. We present HaVOC: a run-time memory manager that vir-tualizes the hybrid on-chip memory space and supports efficient sharing ofdistributed ScratchPad Memories (SPMs) and NVMs. HaVOC allows pro-grammers and the compiler to partition the application’s address space andgenerate data/code layouts considering virtualized hybrid-address spaces.We define a data volatility metric used by our hybrid-memory-aware com-pilation flow to generate memory allocation policies that are enforced atrun-time by a filter-inspired dynamic memory algorithm. Our experimentalresults with a set of embedded benchmarks executing simultaneously on aChip-Multiprocessor with hybrid NVM/SPMs show that HaVOC is able toreduce execution time and energy by 60.8% and 74.7% respectively withrespect to traditional multi-tasking-based SPM allocation policies.

Categories and Subject Descriptors C.3 [Special-purpose andApplication-based systems]: Real-time and embedded systems;B.3 [Design Styles]: Virtual Memory; D.4 [Storage Management]:Distributed memories

General Terms Algorithms, Design, Management, Performance

1. IntroductionThe ever increasing complexity of embedded software and adop-tion of open-environments (e.g., Android OS) is exacerbating thedeployment of multi-core platforms with distributed on-chip mem-ories [15, 20]. Traditional memory hierarchies consist of caches,however, it is known that caches may consume up to 50% of theprocessor’s area and power [3]. As a result, ScratchPad Memories(SPMs) are rapidly being adopted and incorporated into multi-coreplatforms for their high predictability, low area and power con-sumption. Efficient SPM management can greatly reduce dynamicpower consumption [16, 19, 26, 30], and may be a good alterna-tive to caches for applications with high levels of regularity (e.g.,Multimedia). As sub-micron technology continues to scale, leakagepower will overshadow dynamic power consumption [21]. SinceSRAM-based memories consume a large portion of the die, theyare a major source of leakage in the system [2], which is a majorissue for multi-core platforms.

In order to reduce leakage power in SRAM-based memories, de-signers have proposed emerging Non-Volatile Memories (NVMs)

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.WXYZ ’05 date, City.Copyright c© 2011 ACM [to be supplied]. . . $10.00

as alternatives to SRAM for on-chip memories [17, 24, 28]. Typi-cally, NVMs (e.g., PRAM [11]) offer high densities, low leakagepower, comparable read latencies and dynamic read power withrespect to traditional embedded memories (SRAM/eDRAM). Onemajor drawback across NVMs is the expensive write operation(high latencies and dynamic energy). To mitigate the drawbacks ofthe write operation in NVMs, designers have made the case for de-ploying hybrid on-chip memory hierarchies (e.g., SRAM, MRAM,and PRAM) [28], which have shown up to 37% reduction in leak-age power [14], and increased IPC as a byproduct of the higherdensity provided by NVMs [24, 32]. Orthogonal to traditional hy-brid on-chip memory subsystems which have been predominatelyfocused on caches, Hu et al. [14] showed the benefits of exploitinghybrid memory subsystems consisting of SPMs and NVMs.

In this paper, we present HaVOC, a system-level hardware/soft-ware solution to efficiently manage on-chip hybrid memories tosupport multi-tasking Chip-Multiprocessors. HaVOC is tailored tomanage hybrid on-chip memory hierarchies consisting distributedScratchPad Memories (SRAM) and Non-Volatile Memories (e.g.,MRAMs). HaVOC allows programmers to partition their applica-tion’s address space into virtualized SRAM address space and vir-tualized NVM address space through a minimalistic API. Program-mers (through annotations) and compilers (through static analysis)can then specify hybrid-memory-aware allocation policies for theirdata/code structures at compile-time, while HaVOC dynamicallyenforces them and adapts to the underlying memory subsystem.The novel contributions of our work are that we:

• Explore distributed shared on-chip hybrid-memories consistingof SPMs and NVMs and virtualize their address spaces tofacilitate the management of their physical address spaces

• Introduce the notion of data volatility analysis to drive efficientcompilation and policy generation for hybrid on-chip memories

• Present a filter-driven dynamic allocation algorithm that ex-ploits filtering and volatility to find the best memory placement

• Present HaVOC, a run-time hybrid-memory manager that en-forces compile-time-derived allocation policies through a filter-inspired dynamic allocation algorithm that exploits filtering andvolatility to find the best memory placement

To the best of our knowledge, our work is the first to considerdistributed on-chip hybrid SPM/NVM memories and their dynamicmanagement through the use of virtualized address spaces for re-duced power consumption and increased performance. Our experi-mental results with a set of embedded benchmarks executing simul-taneously on a Chip-Multiprocessor with hybrid NVM/SPMs showthat HaVOC is able to reduce execution time and energy by 60.8%and 74.7% respectively with respect to traditional multi-tasking-based SPM allocation policies.

C/C++

Program Annotations

GCC

Code Placement Analysis

Read/Write Data Reuse Analysis

Volatility Analysis

(F1,F2), (F3,F4),

Reuse Trees

Block Cost & Volatility

Virtual Hybrid SPM/NVM Allocation

Policy Generation

a) User/Compiler Driven Policy Generation Flow for SPM/NVMs Enabled CMPs

OCCUPIED

(F1,F2),(F3,F4),

Data

Application1 Application2

(F5,F6)

Data

(F1,F2), (F3),

Data

(F4,F5), (F6,F7,F8),

… Data

vSPM

vNVM

vSPM

vNVM

HaVOC Manager

Filter-based Allocation

Cost Estimator

(F1,F2),(F3,F4),

Data

Data

(F4,F5),

(F6,F7,F8), …

Data (F5,F6)

Data

SPM Space

sPEM Space

NVM Space

nPEM Space

OCCUPIED

FREE FREE

b) Dynamic Placement Policy Enforcement

(F1,F2), (F3),

Unified PEM

DRAM PEM/PRAM PEM

vSPM

vNVM

DRAM

Figure 1. HaVOC-aware Policy Generation (a) and Enforcement (b).

2. Motivation2.1 Comparison of Memory Technologies

Table 1. Memory Technology Comparison [32]Features SRAM eDRAM MRAM PRAMDensity Low High High Very HighSpeed Very Fast Fast read Slow read

Fast Slow write Very slow writeDyn. Power Low Medium Low read Medium read

High write High writeLeak. Power High Medium Low LowNon-Volatile No No Yes Yes

In order to overcome the growing leakage power consump-tion in SRAM-based memories, designers have proposed emerg-ing Non-Volatile Memories (NVMs) as possible alternatives foron-chip memories [17, 24, 28]. As shown in Table 1 [32], NVMs(MRAM [13], and PRAM [11]) offer high densities, low leakagepower, and comparable read latencies and dynamic read power withrespect to traditional embedded memories (SRAM/eDRAM). Onekey drawback across NVMs is the high write latencies and dy-namic write power consumption. To mitigate the drawbacks of thewrite operation in NVMs, designers have made the case for deploy-ing hybrid on-chip memory hierarchies (a combination of SRAM,MRAM, and PRAM) [14, 32], which have shown up to 37% reduc-tion in leakage power [14], and increased IPC as a byproduct of thehigher density provided by NVMs [24, 32].

2.2 Programming for SPM/NVM Hybrid MemoriesUnlike caches, which have built-in HW policies, SPMs are soft-ware controlled memories as their management is completely leftto the programmer/compiler. At first glance, we would need to taketraditional SPM-management schemes and adapt them to managehybrid memories. However, SPM-based allocation schemes (e.g.,[16, 26, 30]) assume physical access to the memory hierarchy; con-sequently, the traditional SPM-based programming model wouldrequire extensive changes to account for the different characteris-tics of the NVMs. Thus motivating the need for a simplified addressspace to minimize changes to the SPM-programming model.

2.3 Multi-tasking Support for SPM/NVM Hybrid MemoriesThe challenge of programming and managing SPM/NVM-basedhybrid memories is aggravated by the adoption of open environ-ments (e.g., Android OS), where users can download applications,install them, and run them on their devices. In these environments,it is possible that many of the running processes will require ac-cess to the physical SPMs, therefore, programmers/compilers canno longer assume that their applications are the only ones runningon the system. Traditional SPM-sharing approaches [9, 27, 29]would either allocate part of the physical address space to eachprocess (spatial allocation) or time-share the SPM space (tempo-ral allocation). Once the entire SPM space has been allocated, all

remaining data is then mapped to off-chip memory. In order to re-duce the overheads of sharing the SPM/NVMs, our scheme exploitsprogrammer/compiler-driven policies obtained through static anal-ysis/annotations (Sec. 3.2) and uses the information to efficientlymanage the memory resources at run-time (Sec. 3.3).

2.4 Target Platform

CPU

SPM NVM

SPM NVM

SPM NVM

HaVOC CPU CPU

SPM NVM

CPU

NVM

DRAM

SPM/NVM Enhanced Chip-MultiProcessor

DMA

Figure 2. HaVOC Enhanced CMP with Hybrid NVM/SPMs.

Figure 2 shows a high level diagram of our SPM/NVM en-hanced CMP, which consists of a number of OpenRISC-likecores, the HaVOC manager, a set of distributed SPMs and NVMs(MRAM or PRAM), and a DRAM/PRAM/Flash main memory hi-erarchy. For this work we focus on bus-based CMPs, so the commu-nication infrastructure used here is an AMBA AHB bus, however,we can also exploit other types of communication fabrics.

2.5 AssumptionsWe make the following assumptions: 1) The application can bestatically analyzed/profiled so that code/data blocks can be mappedto SPMs [15, 16, 29]. 2) We operate over blocks of data (e.g.,1KB mini-pages). 3) We can map all code/data to on-chip/off-chip memory and do not use caches (e.g., [15]). 4) Part of off-chipmemory can be locked in order to support the virtualization of theon-chip SPM/NVM memories.

3. HaVOC OverviewFigure 1 (a) shows our proposed compilation flow, which takes an-notated source code, and performs various types of SPM/NVM-aware static analysis techniques (e.g., code placement, data reuseanalysis, data volatility analysis); the compiler then uses this infor-mation to generate allocation policies assuming the use of virtualSPMs (vSPMs) and virtual NVMs (vNVMs). Figure 1 (b) showsour proposed dynamic policy enforcement mechanism for multi-tasking CMPs. The HaVOC manager (black box) takes in the vSP-M/vNVM allocation policies provided by each application (Ap-plication 1 & 2), and decides how to best utilize the underlyingmemory resources. The rest of this section will go over each ofthe different components at a high level. The rest of paper will usedata/code block interchangeably as our approach supports place-ment of both data and code onto the hybrid memories and will referto SPM/NVM-based hybrid memories as hybrid memories.

3.1 Virtual Hybrid-Memory SpaceIn order to present the compiler/programmer with an abstractedview of the hybrid-memory hierarchy and minimize the complex-ity of our run-time system we propose the use of virtual SPMsand virtual NVMs. We leverage Bathen’s [4] concept of vSPMs,which enables a program to view and manage a set of vSPMs asif they were physical SPMs. In order virtualize SPMs, a small partof main memory (DRAM) called protected evict memory (PEM)space was locked and used as extra storage. The run-time systemwould then prioritize the data mapping to SPM and PEM spacebased on a utilization metric. In this work we introduce the con-cept of virtual NVMs (vNVMs), which behave similarly to vSPMs,meaning that the run-time environment transparently allows eachapplication to manage their own set of vNVMs. The main differ-ences (w.r.t [4]) are: 1) Depending on the type of memory hier-archy, vNVMs can use a unified PEM space denoted as sPEM oroff-chip NVM memory as extra storage space, denoted as nPEM.2) The allocation cost estimation takes into account the volatilityof a data (frequency of writes over its lifetime on a given mem-ory), the cost of bringing in the given data, and the cost of evictingother application’s data. 3) HaVOC’s dynamic policy exploits thenotion of volatility and filtering (Sec. 3.5) to efficiently managethe memory real-estate. Management of virtual memories is donethrough a small set of APIs, which send management commands tothe HaVOC manager. The HaVOC manager then presents each ap-plication with intermediate physical addresses (IPAs), which pointto their virtual SPMs/NVMs. Traditional SPM-based memory man-agement requires the data layout to use physical addresses by point-ing to the base register of the SPMs, as a result, the same is expectedof SPM/NVM-based memory hierarchies [14]. In our scheme, allpolicies use virtual SPM and NVM base addresses, so any run-timere-mapping of data will remain transparent to the initial allocationpolicies as the IPAs will not change.

3.2 Hybrid-memory-aware Policy GenerationThe run-time system needs compile-time support in order to makeefficient allocation decisions at run-time. In this paper we presentvarious ways by which designers may generate policies (manualthrough annotations or through static analysis). These policies arethen enforced (currently in best effort fashion) by the run-timesystem in order to prioritize the access to SPM/NVM space for thevarious applications running on the system. Each policy attemptsto map data to virtual SPMs/NVMs, while the HaVOC managerdynamically maps the data to physical memories.

3.2.1 Volatility AnalysisWe introduce a new metric, data volatility, to facilitate efficientloading of data on the hybrid on-chip memory configurations. Datavolatility is defined as the write frequency of a piece of data overits accumulated lifetime. In order to estimate the volatility of a datablock we first define a sampling time (STi), which can be in cycles,so that the union of all sample times equals the block’s lifetime(Eq. 1). Next, we calculate the write frequency for each sampletime (Eq. 2). Finally, we estimate the volatility of the data as thevariation in its write frequency (Eq. 3). This metric is useful whendeciding whether data is worth (cost effective) being mapped ontoNVM. Highly volatile data implies that at some point the cost ofkeeping data in NVM during its entire lifetime might be greaterthan leaving it in main memory. As a result, when two competingapplications request NVM space, the estimated cost function (e.g.,energy savings) will be used to prioritize allocation of on-chipspace, while volatility can be used as a tie breaker and predictionmetric of cost fluctuation. Volatility may also be used to decide thegranularity at which designers might do their data partitioning.

Datalifetime ←n⋃

i=0

STi (1)

Writeifreq. ←Writesi ÷ STi, i = 0 · · ·n (2)

Datavolatility ← STDEV (Writeifreq), i = 0 · · ·n (3)

C(Di) = Cload(Dfi ) + Cevict(D

ti)+

Cutil(Dri , D

wi ) + ∆leak(Dr

i , Dwi , D

lifetimei )

(4)

We define the expected cost metric (C(Di)) for a given datablock (Di) as shown in Eq. 4, which takes into account the cost oftransfering data between memory type Df

i and memory type Dti

(Cload, Cevict), the utilization cost (Cutil), and the extra leakagepower consumed by mapping the given data to the preferred mem-ory type. The cost represents energy or latency.

block_decode(…){!…!f1(a, cnum)!!

f2(a, b)!!

f3(b, t)!!

f4(b, c)!!

f5(c, SHIFT)!!

f6(c, BOUND)!!

a

qt zt

b

c

a1

qt zt b1

c1

a2 b2 b3

c2 c3

(0,64,c)

(64,64,c) (0,64,c) (64,64,c) (64,64,c) (0,64,c)

(64,64,c)

(64,64,c)

(64,0,c)

(64,0,c)

(reads,writes,cycles)

a) Code c) Functional Lifetime d) Kernel Lifetime

a qt zt b c

b) Global Lifetime Data Volatility

zt,qt: 128 writes,180x128 reads zt,qt: 180x128 writes, 180x128 reads

Figure 3. Data volatility across various lifetime granularities.

Figure 3 (a) shows sample code from JPEG [12] and how par-titioning its data’s lifetime may affect its data’s volatility. Figure 3(b) shows the global life time of the data arrays (a, b, c, zt, qt),where the number of accesses to NVM would be (128 rd, 23K wr)for qt/zt if we map and keep them in NVM during the entire execu-tion of the program. To accomodate other data structures onto SPMspace, arrays qt/zt’s lifetime may be split, resulting in finer life-timegranularities (Figure 3 (c-d)). Though the read/write ratio of dataremains the same (qt/zt still have 23K reads to 0 writes), finer gran-ularity lifetimes might result in higher volatility (qt/zt now do 23Kwrites to NVM since they are reloaded every time block decode isexecuted), making qt/zt poor mapping candidates for NVM.

3.2.2 AnnotationsProgrammers can embedded application-specific insights intosource code through annotations [10] in order to guide the com-pilation process. Since we are working with virtualized addressspace, programmers can create hybrid-memory-aware policies thatdefine the placement for a given data structure by simply definingthe following parameters: <preferred memory type, reads, writes,lifespan, volatility>. These annotations are carried through thecompilation process and used at run-time by the HaVOC manager,which uses the annotated information to allocate the data onto thepreferred memory type.

3.2.3 Instruction PlacementInstructions/Stack-data are a very good candidates for mappingonto on-chip NVMs since their volatility is quite low (e.g., writeonce and use many times). In this work, we borrow traditionalSPM-based instruction-placement schemes [18] and enhance themto take into account the possibility of mapping the data to NVMmemories by introducing data volatility analysis into the flow. Likewe discussed in Sec. 3.2.1, the granularity of the code partitioning(e.g, function, basic block, etc.) will affect how volatile the place-ment will become. As a result, when mapping a piece of code ontovNVM/vSPM, we need to partition our code such that Eq. 5 is met,where C(Di) represents the cost in Eq. 4. Our goal is to partition

the application such that we can minimize the number of code re-placements ([18]) in order to minimize energy and execution time.

C(Di)Off−Chip � C(Di)On−Chip (5)

3.2.4 Reuse Tree Analysis

for(x=0;x<=9;x++)! a[x+1]=b[10*x+1]!

for x = 0 to 9! a[x+1]! b[10*x+1]!

for x = 0 to 9! a[x+1]!

for x = 0 to 9! b[10*x+1]!

Ref. node a[x+1]

Ref. node b[10*x+1]

Root node

Reuse node (x,0,9)

Reuse node (x,0,9)

Ref. node a[x+1]

Ref. node b[10*x+1]

Reuse node (x,0,9)

Reuse node (x,0,9)

Root node Root node

a) Original Program b) DRDU Input

c) Reuse Tree

d) NVM/SPM-aware DRDU Input

Write Reference

Read Reference

e) R/W Reuse Trees

Figure 4. Reuse Tree Examples: a) Original Code, b) OriginalDRDU Input, c) Original Reuse Tree, d) Hybrid-memory-awareDRDU Input, e) Hybrid-memory-aware Reuse Trees.

We exploit data reuse analysis (DRDU) [16] to guide the map-ping of data onto the vNVM/vSPM space. We adapted Issenin’sDRDU algorithm to account for differences in NVM/SPM ac-cess energy and latency. DRDU takes as input access patternsderived from C source code in affine-expression mode and pro-duces reuse trees (Figure 4 (a-c)), which are then used to decidewhich data buffers to map to SPMs (typically highly reused data).Our scheme splits the access patterns into read, write, and read/-modify reuse trees (Figure 4 (d)) and feeds them separate to theDRDU algorithm. We then use the reuse trees (Figure 4 (e)) gen-erated to decide what data to map onto vNVM/vSPM space. Theidea is to map highly read-reused data with a long access dis-tance onto vNVM to minimize number of fetches from off-chipmemory, while highly reused read-data with short lifetimes willbe mapped to vSPM preferably. Highly reused-read-modify datawith low write-volatility should be mapped to vNVM, while highlyreused write-data and read-modify data with high write-volatilityshould go to vSPM.

3.2.5 Memory Allocation Policy GenerationPolicy generation is not limited to the schemes above; they areexemplars of how our virtualization layer can be used to seamlesslymanage on-chip hybrid memories. The last step in our flow involvesthe generation of near-optimal hybrid-memory layouts we defineas allocation policies. This process takes as input data/code blockswith various pre-computed costs (e.g., Eq. 4) obtained from staticanalysis (Sections 3.2.1,3.2.4, and 3.2.3) and annotations (Sec.3.2.2), which are combined and feeds them as input to an enhancedhybrid-memory-aware allocator based on [14].

3.3 HaVoC Manager

Con

fig.

Mem

ory

(SR

AM

)

iS-DMA

Allocation/Eviction Logic De-allocator

Read Write/Config.

HaVOC Slave IF

Sla

ve

Rea

d S

lave

W

rite

HaVOC Manager

Address Translation Layer

HaVOC Master IF

Cost Estimation Logic

On-Chip Bus

On-Chip Bus

Off-Chip Bus

Figure 5. HaVOC Manager

The HaVOC manager may be implemented in software as anextended memory manager embedded within a hypervisor/OS oras a hardware module (e.g., [4, 9]). The software implementationis quite flexible and does not require modifying existing platforms.The hardware version requires additional hardware and the nec-essary run-time support, but the run-time overheads will be muchlower than the software version. In this work, we present a proof-of-concept embedded hardware implementation (Figure 2). Fig-ure 5 shows a block diagram of the HaVOC manager. It consistsof a memory-mapped slave interface, which is used by the sys-tem’s masters (e.g., CPUs) and handles the read/write/configura-tion requests. The address translation layer module converts IPAsto physical addresses (PAs) in one cycle [4, 7]). The manager con-sists of 1KB to 256KB of configuration memory used to keep blockmetadata information (e.g., volatility, # accesses, etc.). The alloca-tion/eviction logic uses the cost estimation (e.g., efficiency) logicto prioritize access to on-chip storage. Finally, the internal DMA(iS-DMA) allows the manager to asynchronously transfer data be-tween on-chip and off-chip memory. In order to use the HaVOCmanager, the compiler generates two things: 1) the creation of therequired virtual SPM/NVMs through the use of our APIs (createsconfiguration packets) and 2) The use of IPAs instead of PAs duringthe data/code layout stage (e.g., memory maps using purely virtualaddresses for SPMs/NVMs). Any read/write to a given IPA is trans-lated and routed to the right PA. Any write to HaVOC configurationmemory space is then used to mange the virtualized address space.The goal is to allow each application to manage its virtual on-chipmemories as if it had full access to the on-chip real-estate.

3.4 HaVOC’s Intermediate Physical Address

32-bit HaVOC Metadata

Physical Address Offset 16

Efficiency 8

V M2 6

Figure 6. HaVOC’s Virtual Memory Metadata

Figure 6 shows the metadata needed to keep track of the allo-cated blocks. Note that these fields are tunable parameters and weleave the exploration for future work. The Physical Address Offsetis used to complement the byte offset in the original IPA (Figure 7)when translating IPA to PA addresses. The Efficiency and thevolatility (V ) fields are used during the allocation/eviction phasesas metrics to guide the allocator’s decisions. The M field representsthe preferred memory type (e.g., SPM, NVM, DRAM).

Figure 7 shows a high level view of the layout of one applica-tion during its execution and the address translation an IPA under-goes. Each application requiring access to a virtual SPM or virtualNVM must do so through an IPA, which means creating the de-sired # of vSPMs and vNVMs for use (see Sec. 3.8). For the sakeof demonstration, we set the on-chip SRAM to 4KB, so to addressa vSPM’s block metadata we only need 12bits. The IPA is trans-lated and complemented by the Physical Address Offset fetchedfrom the block metadata (Figure 7 (a)). These 16bits combinedwith the 10bit Byte Offset make out the physical address of the on-chip SPM/NVM memory (Figure 7 (b)). Though we can virtualize226 Bytes of on-chip storage, we are bounded by the total meta-data stored in configuration memory. Figure 7 (c) shows the ini-tial mapping of Application 1, and Figure 7 (d) shows an updatedmapping after Application 2 has launched. Notice that the IPA re-mains the same for both, as HaVOC updates its allocation table andits block’s metadata. Re-mapping blocks of data/code requires theHaVOC manager to launch an interrupt which tells the CPU run-ning Application 1 that it needs to wait for the re-mapping of its

F1,F2 F4,F5 A/B

F3

IPA1

IPA1

IPA2

IPA2

vSPM1

vSPM2

vNVM1

vNVM2

F1,F2 F4,F5 A/B

C

H,I

F3 F6

D,E,F

G

F1,F2

C

H,I

F3 F6

G D,E,F

Z

32-bit Intermediate Physical Address (IPA)

Byte Offset Virtual Memory Offset 10 22

F6

X,Y

F1,F2,F4

F4,F5 A/B F3

C

H,I

F6 D,E,F G

vSPM1 Metadata

vNVM1 Metadata HaVOC Configuration Memory (4KB SRAM)

a) Metadata (10 bits offset)

Physical Address Offset (16bits)

b) 16-bit Physical Address Generation

c) Initial Physical Address Mapping

d) Updated Physical Address

Application 1 Application 1 Applications 1 and 2

Figure 7. HaVOC’s Intermediate Physical Address

data. It is possible to avoid interrupting execution as HaVOC hasan internal DMA module so it may asynchronously evict blocks. Inthe event that there is a request for a piece of data belonging to ablock being evicted, the HaVOC manager waits until re-mapping isdone since read/write/configuration requests to HaVOC are block-ing requests (e.g., HaVOC behaves like an arbiter).

3.5 HaVOC’s Dynamic Policy Enforcement

Table 2. Filter Inequalities and Preferred Memory TypeFilter Pref. InequalitiesF1 sram E(Dspm

i ) > E(Ddrami )

∧E(Dnvm

i ) < E(Ddrami )

∧V > Tvol

F2 nvm E(Dnvmi ) > E(Ddram

i )∧

V < Tvol

F3 either E(Dspmi ) > E(Ddram

i )∧

E(Dnvmi ) > E(Ddram

i )

F4 dram E(Dspmi ) < E(Ddram

i )∧

E(Dnvmi ) < E(Ddram

i )

Algorithm 1 FilterDynamic Allocation Algorithm

Require: req{size, cost, volatility}1: pref mem← filter(req)2: if allocatable(req, pref mem) then

3: return ipa← update alloc table(req)4: end if5: minset ← sortHi2LowEff (alloc table, size)

6: if E(minset) < Cevict(minset) + E(req) then

7: evict(minset)8: return ipa← update alloc table(req)9: else10: return ipa← mm malloc(req)11: end if

A CX B CX CPU SPM

NVM

A CX B CX CPU SPM

NVM

DRAM

A CX B CX CPU SPM

NVM

DRAM

SPM Only

NVM/SPM

NVM Only

a) Temporal b) FixedDynamic c) FilterDynamic d) Preferred memories

Figure 8. Dynamic Hybrid-memory Allocation Policies

In order to support dynamic allocation of hybrid-memory spacewe define three block-based allocation policies: 1) Temporal alloca-tion, which combines temporal SPM-allocation ([29]) and hybrid-memory allocation ([14]), and adheres to the initial layout obtainedthrough static analysis (Sec. 3.2); however, the application’s SPMand NVM contents must be swapped on a context-switch to avoidconflicts with other tasks (Fig. 8 (a)). 2) FixedDynamic allocation,which combines dynamic-spatial SPM-allocation ([9]) and hybrid-memory allocation [14], and maps the data block to the preferredmemory type (adhering to the initial layout) as long as there isspace, otherwise, data is mapped to DRAM (Fig. 8 (b)). 3) Filter-Dynamic allocation (Alg. 1), which exploits the concept of filteringand volatility to find the best placement. Each request is filtered ac-cording to a set of inequalities (shown in Table 2) which determinethe preferred memory type (Fig. 8 (d)). The volatility of the datablock (V ) and its mapping efficiency (E(Di) = C(Di)/|Di|) areused to determine what memory type would minimize the block’senergy (or access latency). For instance, data with low volatilityand high energy efficiency could potentially benefit more from be-ing mapped to NVM than SRAM (e.g., filter F2 in Table 2). If thereis enough preferred memory space (e.g., SPM or NVM), the dy-namic allocator adheres to Eq. 4 prior to loading the data. If thereis not enough space, then the allocator follows Alg. 1 and sortsthe allocated blocks from highest to lowest efficiency (e.g., en-ergy per bit). It then compares the cost of evicting the least im-portant blocks (MINSet) with the cost of dedicating the spaceto the new block. If the efficiency of bringing the new block off-sets the eviction cost and efficiency of the data already mappedto the preferred memory type (E(MINSet)), then HaVOC evictsthe MINSet blocks and updates the allocation table with the newblock (|new block| ≤ |MINSet|). In the event the preferred mem-ory type is either NVM or SPM (filter F3 in Table 2), the allocatorevicts the min(MINspm

Set ,MINnvmSet ). At the end, HaVOC allo-

cates either on-chip space or off-chip space (unified PEM space(sPEM)), resulting in the allocation shown in Fig. 8 (c), where adata block originally intended for SPM is mapped to NVM.

3.6 Dynamic Allocation Decision Flow

Filter

SPM

NVM

EITHER

DRAM

Espm > Edram && Envm < Edram && V > Tvolatility Envm > Edram && V < Tvolatility

Espm < Edram && Cnvm > Cdram

Espm < Edram && Cnvm < Cdram

D,M Enough Space

D,PM

Allocate (D,PM)

YES

EITHER SORT (HàL,GOAL)

D

Remove

Enough Space (SRAM)

Enough Space (NVRAM)

NO YES,PM=SPM

YES,PM=NVM

NO,PM=DRAM

Figure 9. Application Execution

Figure 9 shows the decision flow followed by our FilterDynamicallocation policy. Each allocation request for data/code block (D)is filtered by applying a set of inequalities to derive the preferredmemory type (PM ) for the given block. This results in a tuple(D,PM ), which is forwarded to the allocation engine. The alloca-tor checks if there is enough space in the preferred memory (PM ),if so, allocate the block in the preferred memory (D,PM ). If thereis not enough space or the preferred memory type is either SPM orNVM, then we sort the allocated blocks from highest to lowest effi-ciency generating a set of least useful allocated blocks (MINmem

Set )as long as the sum of the blocks is less than or equal to the requestedblock size. Since we are using block-based allocation, it is possi-ble to optimize this step. For instance, it the block unit is 1KB andeach request block is 1KB, then the (MINmem

Set ) is composed of asingle 1KB block. The backend implementation is open for explo-ration so that we may find what the optimal/best block size shouldbe as the block size affects how much required on-chip configura-tion memory we may need to store the block’s metadata. The nextstep may be done in parallel (if either memory is preferred), or syn-chronous (if there is not enough space in the preferred memory). Inthe case either memory is acceptable, then the allocator evicts themin(MINspm

Set ,MINnvmSet ), and allocates the space to the new

block (D). Another optimization to bypass the sorting step, it ispossible to maintain a small linked list of blocks, sorted from low-est to highest efficiency, so that rather than sorting on every trans-action, we keep a sorted list, and compute the MINmem

Set using thehead of the list, and on eviction we can update the list. ComputingMINmem

Set would then be O(1), and updating the list O(Blocks).This of-course comes at the cost of extra storage to keep the list,therefore increasing the size of the configuration memory.

3.7 Application LayoutFigure 10 shows a snapshot of the resulting allocation policy (Fig-ure 10 (c)) for the JPEG [12] benchmark. First, we start with thesource code (Figure 10 (a)) which is statically analyzed. Second,the application’s memory map is shown in Figure 10 (b). Note thatthe source code itself is executed 180 times, and if loaded once,its volatility is very low, if loaded every time it gets executed, thenits volatility increases. Figure 10 (c) shows the block informationconsisting of the virtual start/end addresses, number of accesses,volatility, and the preferred memory type (e.g., NVM).

Figure 11 shows a snapshot of a subset of blocks for the threedynamic allocation policies discussed in this paper (Oracle, Fixed-Dynamic, FilterDynamic). The bold-red lines mark difference inmappings between the various policies. These allocations are acombination of the concurrent execution of the ADPCM and AESbenchmarks [12]. As we can observe, the FixedDynamic policy

runs out of SPM space, and all data mapped to SRAM or NVMis mapped to off-chip memory (DRAM). The Oracle and Fil-terDynamic policies have similar resulting mappings, however,there are some difference as in this case; Buffer 0x10<25> ismapped to NVM by Oracle and to SRAM by FilterDynamic.The Buffer 0x10<25> block appears at a time when evictingthe (MINnvm

Set ) is more expensive than the benefits of mappingBuffer 0x10<25> to NVM, however, the FilterDynamic policy re-alizes that SRAM space is more cost-effective (at the time), andmaps the block to SRAM.

3.8 Execution Timing Diagram

CPU HaVOC Create vSPM & vNVMs Load Allocation Policy (Code) Load Allocation Policy (Data)

Fast, hundreds of cycles

Depends on system state, eviction, size of data, memory type

Depends on size of data & memory type

Load/Data/Code

DMA

Figure 12. Application Execution

Figure 12 shows a timing diagram for the loading/execution ofan application. First, prior to executing the application, we createthe vSPMs/vNVMs. Second, we load the application’s allocationpolicies, which determine how to allocate the memory space. Fi-nally, once the space is created, the application will continue toexecute as if it was talking to physical SPMs/NVMs.

4. Related WorkMost work in hybrid memories has focused on replacing/comple-menting main memory (DRAM) or caches (SRAM) with a com-bination of various NVMs to reduce leakage power and increasethroughput (a by product of higher NVM density). Joo et al. [17]proposed PCM as an alternative to SRAM for on-chip caches. Sunet al. [28] introduced MRAM into the cache hierarchy of a NUCA-based 3D stacked multi-core platform. Mishra et al. [24] followedup by introducing STT-RAM as an alternative MRAM memory andhid the overheads in access latencies by customizing how accessesare prioritized by the the interconnect network. Wu et al. [32] pre-sented a hybrid cache architecture consisting of SRAM (fast L1/L2accesses), eDRAM/MRAM (slow L2 accesses), and PRAM (L3).Hu et al. [14] were the first to explore a hybrid SPM/NVM memoryhierarchy and proposed a compile-time allocation algorithm thatexploits the benefits of both NVMs and SPMs for a single appli-cation. Hybrid main memory has also been studied [33]. Mogul etal. [25] and Wongchaowart et al. [31] attempted to reduce writeoverheads in main memory by exploiting page migration and blockcontent signatures. Ferreira et al. [8] introduced a memory man-agement module for hybrid DRAM/PCM main memories. Staticanalysis has been explored to efficiently map application data onoff-chip hybrid main memories [22, 23].

HaVOC is different from approaches that address hybrid cache/-main memories in that we primarily focus on hybrid SPM/NVM-based hierarchies, however, our scheme can be complemented byboth existing schemes that leverage the benefits of both on-chipSRAMs and NVMs as well as hardware/system-level solutionsthat address hybrid off-chip memories (e.g., DRAM/eDRAM andPCM). Our work is different from [14] in that we consider: 1)shared distributed on-chip SPMs and 2) dynamic support for multi-tasking systems, however, our scheme can benefit from compile-time analysis schemes that provide our runtime system with allo-cation hints (e.g., [14, 22, 23]). Like [4], we use part of off-chip

Buffer Name Start Addr End Addr Reads Writes Vola7lity Mapping

Buffer_0x40<10> 4204544 4205567 398708 0 0 NVM

Buffer_0x40<11> 4205568 4206591 595805 0 0 NVM

Buffer_0x40<12> 4206592 4207615 440754 0 0 NVM

Buffer_0x40<13> 4207616 4208639 313128 0 0 NVM

.text 0x004001f0 0x6130 /tmp/cc7mgwib1.o

… 0x00402b28 IZigzagMatrix

…

b) Memory Map

c) Buffer Information and Mapping

void IZigzagMatrix ()!{! int i;! for (i = 0; i < DCTSIZE2; i++)! {! omatrix[i] = imatrix[zigzag_index[i]];! }!}!

a) Source Code

Figure 10. Layout Information After Static Analysis

ORACLE reads writes efficiency FIXED reads writes efficiency FILTER reads writes efficiency Buffer_0x40<9> NVM 34012 0 33 Buffer_0x40<9> NVM 34012 0 33 Buffer_0x40<9> NVM 34012 0 33 Buffer_0x7f<31> DRAM 7 0 0 Buffer_0x7f<31> DRAM 7 0 0 Buffer_0x7f<31> DRAM 7 0 0 Buffer_0x7f<32> SRAM 139055 88045 135 Buffer_0x7f<32> DRAM 139055 88045 135 Buffer_0x7f<32> SRAM 139055 88045 135 Buffer_0x7f<34> DRAM 84 82 0 Buffer_0x7f<34> DRAM 84 82 0 Buffer_0x7f<34> DRAM 84 82 0 Buffer_0x10<19> DRAM 230 120 0 Buffer_0x10<19> DRAM 230 120 0 Buffer_0x10<19> DRAM 230 120 0 Buffer_0x10<20> DRAM 62 36 0 Buffer_0x10<20> DRAM 62 36 0 Buffer_0x10<20> DRAM 62 36 0 Buffer_0x10<21> NVM 15254 416 14 Buffer_0x10<21> DRAM 15254 416 14 Buffer_0x10<21> SRAM 15254 416 14 Buffer_0x10<22> SRAM 18680 512 18 Buffer_0x10<22> DRAM 18680 512 18 Buffer_0x10<22> SRAM 18680 512 18 Buffer_0x10<23> SRAM 18722 512 18 Buffer_0x10<23> DRAM 18722 512 18 Buffer_0x10<23> SRAM 18722 512 18 Buffer_0x10<24> SRAM 18892 512 18 Buffer_0x10<24> DRAM 18892 512 18 Buffer_0x10<24> SRAM 18892 512 18 Buffer_0x10<25> NVM 24496 154 23 Buffer_0x10<25> DRAM 24496 154 23 Buffer_0x10<25> SRAM 24496 154 23

Figure 11. Resulting Map

memory to virtualize on-chip memories, however, our approachdiffers in that our primary focus is the efficient management ofhybrid on-chip memories, and as a result, HaVOC’s programmingmodel and run-time environment account for the different physi-cal characteristics of SRAMs, DRAMs and NVMs (MRAM andPRAM). Moreover, we believe that we can complement our static-analysis/allocation policy generation with other SPM-managementtechniques [9, 19, 27, 29, 30].

5. Experimental Results

Table 3. ConfigurationsConfig. Applications CPUs vSPM/vNVM SPM/NVM

Space SpaceC1 adpcm,aes 1 32/128 KB 16/64 KBC2 adpcm,aes,blowfish,gsm 1 64/256 KB 16/64 KBC3 C2 & h263,jpeg,motion,sha 1 128/512 KB 16/64 KBC4 same as C2 2 64/256 KB 32128 KBC5 same as C3 2 128/512 KB 32/128 KBC6 same as C3 4 128/512 KB 64/256 KB

5.1 Experimental Setup and GoalsOur goal is to show that HaVOC is able maximize energy sav-ings and increase application throughput in a multi-tasking environ-ment under various scenarios. First, we generate two sets of hybrid-memory aware allocation policies (Sec. 3.2.5), one set of policiesattempts to minimize execution time (Sec. 5.2) and the other at-tempts to minimize energy (Sec. 5.3). These policies are generatedat compile-time and enforced at run-time by a set of dynamic allo-cation policies (Sec. 3.5) under various system configurations (Ta-ble 3). Next, we show the effects of the allocation policy’s block-size on execution time (Sec. 5.4). We built a trace-driven simula-tor that models a light-weight RTOS, with a round-robin scheduler

and context-switching enabled (time slot = 50K cycles). Our sim-ulator is able to model CMPs consisting of an AMBA AHB bus,OpenRISC-like in-order cores, distributed SPMs and NVMs, andthe HaVOC manager (Fig. 2). We bypassed the cache and mappedall data to either SPM, NVM, or main memory (see Sec. 3.2.5).We obtained traces from the CHStone [12] embedded benchmarksby using SimpleScalar [1]. We model on-chip SPMs (SRAMs),MRAMs and PRAMs by interfacing our simulator with NVSim[6] and set leakage power as the optimization goal. To virtualizeSPMs/NVMs we use the unified PEM space model discussed inSec.3.1 (sPEM). The HaVOC manager consists of 4KB low powerconfiguration memory (SRAM).

5.2 Enforcing Performance Optimized Policies

0.0

0.5

1.0

1.5

C1 C2 C3 C4 C5 C6 AVG C1 C2 C3 C4 C5 C6 AVG

MRAM P/G=P PRAM P/G=P

Normalized Execution Time (Goal=Performance,pageSize=4KB) Temporal Oracle FixedDyn FilterDyn

0.0

0.5

1.0

1.5


MRAM E/G=P PRAM E/G=P

Normalized Energy (Goal=Performance,pageSize=4KB) Temporal

Oracle

FixedDyn

FilterDyn

Figure 13. Normalized Execution Time and Energy Comparisonfor Performance Optimized Policies

For this experiment we generated allocation policies that mini-mized execution time for each application. We then executed eachapplication on top of our simulated RTOS/CMP. Table 3 showseach configuration (C1-6), which has a set of applications run-ning concurrently over a # of CPUs, and a pre-defined hybrid-memory physical space. To show the benefit of our approach weimplemented four policies: The three described in Sec. 3.5 (Tem-poral, FixedDynamic, FilterDynamic), and a policy we call Oracle(black bar in Fig. 13), which is a near-optimal policy because onevery block-allocation request, it feeds the entire memory map tothe same policy generator the compiler uses to generate policiesstatically (see Sec. 3.2.5). The idea is to show that our FilterDy-namic policy (backward-slashed bars in Fig. 13) achieves aroundthe same quality allocation solutions as the more complex Oraclepolicy. Fig. 13 shows the the normalized execution time and energyfor each of the different configurations (C1-6, Goal=Min Execu-tion Time denoted as G=P) using 4KB blocks and different mem-ory types with respect to the Temporal policy. The FixedDynamicpolicy (forward-slashed bars in Fig. 13) suffers the greatest impacton energy and execution time as memory space increases (C4-6)since it adheres to the decisions made at compile-time and does notefficiently allocates memory blocks at run-time. In general, we seethat the FilterDynamic policy performs almost as good (in terms ofenergy and execution time) as the Oracle policy (within 8.45% exe-cution time). Compared with the Temporal policy, HaVOC’s Filter-Dynamic policy is able to reduce execution time and energy by anaverage 75.42% and 62.88% respectively when the initial applica-tion policies have been optimized for execution time minimization.

5.3 Enforcing Energy Optimized Policies

0.0

0.5

1.0

1.5


MRAM L/G=E PCRAM L/G=E

Normalized Execution Time (Goal=Energy,pageSize=4KB) Temporal

Oracle

FixedDyn

FilterDyn

0.0

0.5

1.0

1.5


MRAM E/G=E PCRAM E/G=E

Normalized Energy (Goal=Energy,pageSize=4KB) Temporal

Oracle

FixedDyn

FilterDyn

Figure 14. Normalized Execution Time and Energy Comparisonfor Energy Optimized Policies

Fig. 14 shows the the normalized execution time and energy foreach of the different configurations and memory types (Goal=MinEnergy denoted as G=E) with respect to the Temporal policy. Likeit was in the case of G=P, both the Temporal and FixedDynamicpolicies are unable to efficiently manage the on-chip real-estate.The FilterDynamic and Oracle policies are able to greatly reduceexecution time and energy, the FilterDynamic is within 3.54%of the execution time achieved by the The Oracle policy. Com-pared with the Temporal policy, HaVOC’s FilterDynamic policy isable to reduce execution time and energy by an average 85.58%and 61.94% respectively when the initial application policies havebeen optimized for energy minimization. The goal of this experi-ment was to show that regardless of the initial optimization goal,HaVOC’s FilterDynamic policy is able to achieve as good resultsas the Oracle policy.

0.0

0.5

1.0

1.5

C1

C2

C3

C4

C5

C6

AVG

C1

C2

C3

C4

C5

C6

AVG

MRAM P/G=P PCRAM P/G=P

Normalized Execution Time Block=1KB

0.0

0.5

1.0

1.5

C1

C2

C3

C4

C5

C6

AVG

C1

C2

C3

C4

C5

C6

AVG



0.0

0.5

1.0

1.5

C1

C2

C3

C4

C5

C6

AVG

C1

C2

C3

C4

C5

C6

AVG



0.0

0.5

1.0

1.5

C1

C2

C3

C4

C5

C6

AVG

C1

C2

C3

C4

C5

C6

AVG



Temporal Oracle FixedDyn FilterDyn

Figure 15. Effects of Varying Block Size on Policy Enforcementand Allocation

5.4 Block Size Effect on Allocation PoliciesSo far we have seen that Oracle policy seems as a feasible dynamicallocation solution, which would potentially enhance HaVOC’s vir-tualization engine. However, as we mentioned before, the Oraclepolicy is very complex as it runs in O(Blks∗spmsize ∗nvmsize).The FixedDynamic policy on the other extreme runs in O(Blks),however, its efficiency may be even worse than the Temporal policy.HaVOC’s FilterDynamic policy on the other hand, keeps a semi-sorted list of data blocks, as a result it can be O(Blks) best case orO(Blks log Blks) worst case for sorting, and the final filtering de-cision runs in O(Blksspm+Blksnvm+Blksdram), which resultsin O(Blks log Blks) + O(Blksspm + Blksnvm + Blksdram).Thus, the complexity and execution time of the Oracle policy willincrease orders of magnitude as the number of data/code blocksto allocate increases (Blks) or as the available resources increases(spmsize, nvmsize). This is validated in Fig. 15, where we havevarying block size (as a result, number of blocks to allocate in-creases) and increase in available resources (C3-C6). As we cansee, for block size = 1KB, where the number of blocks to allocateis in the hundreds, the allocation time of the Oracle is orders ofmagnitude greater than the FilterDynamic policy. For block size =2KB, we see that as resources increase, the Oracle allocation timeonce again prevents it from being a feasible solution. On average,we observe that HaVOC’s FilterDynamic is capable of achievingas good solutions as the Oracle policy (within 10% margin) withmuch lower complexity. On average, across all test scenarios (vary-ing page sizes, different optimization goals, and different configu-rations) we see that the FilterDynamic is able to reduce executiontime and energy by 60.8% and 74.7% respectively.

6. Comparison Between Oracle andFilterDynamic Run-times

10

100

1000

10000

100000

1000000

10000000

Oracle1

k

Oracle2

k

Oracle4

k

Oracle8

k

FilterD

yn1k

FilterD

yn2k

FilterD

yn4k

FilterD

yn8k

Agg

rega

ted

Allo

catio

n Ti

me

(x

10K

cyc

les)

Allocation Time Oracle Vs. FilterDynamic (Varying Block Size) C1

C2

C3

C4

C5

C6

Figure 16. Oracle and FilterDynamic Run Time Comparison

Figure 16 shows a more detailed comparison between the ag-gregate run time (in cycles) between the Oracle and FilterDynamic

policies. As we can see, the configuration with the highest re-sources (C6) causes the Oracle policy to spend orders of magni-tude more cycles deciding where to allocate blocks than the Filter-Dynamic policy. Another factor that affects the Oracle policy’s de-cision time is the number of buffers to allocate, which is the highestwhen the block size is 1KB.

7. ConclusionWe presented HaVOC, a hybrid-memory aware virtualization layerfor dynamic memory management of applications executing onCMP platforms with hybrid on-chip NVM/SPMs. We introducedthe notion of data volatility analysis and proposed a dynamic filter-based memory allocation scheme to efficiently mange the hybridon-chip memory space. Our experimental results for embeddedbenchmarks executing on a hybrid-memory-enhanced CMP showthat our dynamic filter-based allocator greatly minimizes execu-tion time and energy by 60.8% and 74.7% respectively with respectto traditional multi-tasking-aware SPM-allocation policies. Futurework includes: 1) Integrating HaVOC’s concepts in a hypervisorto support full on-chip hybrid-memory virtualization. 2) Integra-tion, evaluation, and enhancement of other SPM-based allocationpolicies for hybrid-memories. 3) Add off-chip NVM memories tosupport the virtualization of on-chip NVMs (nPEM).

References[1] T. Austin et al. Simplescalar: an infrastructure for computer system

modeling. Computer, 35(2), Feb. 2002.

[2] N. Azizi et al. Low-leakage asymmetric-cell sram. TVLSI, Vol. 11,aug. 2003.

[3] R. Banakar et al. Scratchpad memory: design alternative for cacheon-chip memory in embedded systems. In CODES ’02, 2002.

[4] L. Bathen et al. SPMVisor: dynamic scratchpad memory virtualiza-tion for secure, low power, and high performance distributed on-chipmemories. In CODES+ISSS ’11, 2011.

[6] X. Dong et al. Pcramsim: system-level performance, energy, and areamodeling for phase-change ram. In ICCAD ’09, 2009.

[7] B. Egger et al. Dynamic scratchpad memory management for code inportable systems with an mmu. ACM TECS, 7, January 2008.

[8] A. Ferreira et al. Using pcm in next-generation embedded spaceapplications. In RTAS ’10, april 2010.

[9] P. Francesco et al. An integrated hardware/software approach for run-time scratchpad management. In DAC ’04, 2004.

[10] S. Guyer et al. An annotation language for optimizing softwarelibraries. In DSL ’99, 1999.

[11] S. Hanzawa et al. A 512kb embedded pcm with 416kb/s write through-put at 100 ua cell write current. In ISSCC ’07, feb. 2007.

[12] Y. Hara et al. Chstone: A benchmark program suite for practical c-based high-level synthesis. In ISCAS ’08, May 2008.

[13] M. Hosomi et al. A novel nonvolatile memory with spin torque transfermagnetization switching: spin-ram. In IEDM ’05, dec. 2005.

[14] J. Hu et al. Towards energy efficient hybrid on-chip scratch padmemory with non-volatile memory. In DATE ’11, march 2011.

[15] IBM. The cell project. http://www.research.ibm.com/cell/, 2005.

[16] I. Issenin et al. Multiprocessor system-on-chip data reuse analysis forexploring customized memory hierarchies. In DAC ’06, 2006.

[17] Y. Joo et al. Energy- and endurance-aware design of phase changememory caches. In DATE ’10, march 2010.

[18] S. Jung et al. Dynamic code mapping for limited local memorysystems. In ASAP ’10, july 2010.

[19] M. Kandemir et al. Dynamic management of scratch-pad memoryspace. In DAC ’01., 2001.

[20] S. Kaneko et al. A 600mhz single-chip multiprocessor with 4.8gb/sinternal shared pipelined bus and 512kb internal memory. In SSCC’03, 2003.

[21] N. Kim et al. Leakage current: Moore’s law meets static power.Computer, 36(12), dec. 2003.

[22] K. Lee et al. Application specific non-volatile primary memory forembedded systems. In CODES+ISSS ’08, 2008.

[23] T. Liu et al. Power-aware variable partitioning for dsps with hybridpram and dram main memory. In DAC ’11, june 2011.

[24] A. Mishra et al. Architecting on-chip interconnects for stacked 3dstt-ram caches in cmps. In ISCA ’11, 2011.

[25] J. Mogul et al. Operating system support for nvm+dram hybrid mainmemory. In HotOS’09, 2009.

[26] P. Panda et al. Efficient utilization of scratch-pad memory in embeddedprocessor applications. In EDTC ’97, 1997.

[27] V. Suhendra et al. Scratchpad allocation for concurrent embeddedsoftware. In CODES+ISSS ’08, 2008.

[28] G. Sun et al. A novel architecture of the 3D stacked MRAM L2 cachefor CMPs. In HPCA ’09, feb. 2009.

[29] H. Takase et al. Partitioning and allocation of scratch-pad memory forpriority-based preemptive multi-task systems. In DATE ’10, 2010.

[30] M. Verma et al. Data partitioning for maximal scratchpad usage. InASP-DAC ’03, 2003.

[31] B. Wongchaowart et al. A content-aware block placement algorithmfor reducing pram storage bit writes. In MSST ’10, 2010.

[32] X. Wu et al. Hybrid cache architecture with disparate memory tech-nologies. In ISCA ’09, 2009.

[33] P. Zhou et al. A durable and energy efficient main memory using phasechange memory technology. In ISCA ’09, 2009.

Documents

An Embedded Hybrid-Memory-Aware Virtualization Layer for ... · B.3 [Design Styles]: Virtual Memory; D.4 [Storage Management]: Distributed memories General Terms Algorithms, Design,