10
A High-Performance and Energy-Efcient Virtually Tagged Stack Cache Architecture for Multi-Core Environments Suk chan Kang , Chrysostomos Nicopoulos , Hyunggyu Lee and Jongman Kim School of Electrical and Computer Engineering, Georgia Institute of Technology, Email: {skang37, hyunggyu, jkim}@gatech.edu Department of Electrical and Computer Engineering, University of Cyprus, Email: [email protected] Abstract—Virtually tagged caches possess a key attribute that renders them more attractive than Physically Indexed, Physically Tagged (PIPT) caches; they operate natively within the virtual address space, hence taking full advantage of the original intention of a virtual memory implementation: the illusion of a contiguous address space. Consequently, virtually tagged caches eliminate Translation Look-aside Buffer (TLB) references, yielding both energy and performance improve- ments. On the other hand, virtually tagged caches incur substantial overhead in resolving homonym/synonym issues, which is a fairly complicated process in contemporary multi- core environments. In this paper, we aim to markedly alleviate this overhead through the use of a new virtually tagged stack cache design specically targeting multi-core environments. It will be demonstrated that special level-one virtually tagged stack caches can signicantly boost the performance of a system running a heavy, dominating, multi-threaded workload – among other applications – while actually reducing its energy consumption. This scheme is aimed at modern server environ- ments that run a single, dedicated multi-threaded application workload per server. The proposed virtually tagged stack cache for multi-core processors minimizes the overhead incurred in resolving virtual-tag-related artifacts, by granting exclusive access to only one multi-threaded workload at a time. In other words, said virtually tagged cache lters the Virtual Address (VA) spaces and subsequently handles only the stack areas of the selected virtual address space. A cost effective way to implement the proposed stack cache for multi-core systems is also presented, yielding average performance improvements of around 20%. Keywords-Stack Cache, Non-Uniform Cache Access (NUCA), Multi-Threaded, Multi-Core, Fine-Grain Thread Parallelism, CR3 Register Filtering, TLB Power Consumption I. I NTRODUCTION The memory hierarchy has been a mainstay of modern computer systems. The seemingly interminable divergence in the performance of memory and logic components (i.e., the widening memory-logic gap) has cemented the memory hierarchy as one of the principal components of all modern microprocessors. Atop this hierarchy sit the various cache levels, which are typically implemented on the same silicon die as the Central Processing Unit (CPU) itself. In gen- eral, higher-level caches target short access latencies. They achieve this by limiting both the cache size and associativity. On the contrary, lower-level caches are designed primarily to reduce the access misses (possibly at the expense of access latency). An intelligently designed small-size helper cache can overcome some of the inherent shortcomings of level- one (L1) caches. Victim caches are one example of such small helper caches that can provide additional associativity. The L1 cache design trade-off between capacity and access latency implies longer access times with larger cache sizes. For instance, a 64 KB L1 cache requires a two-to-three cycle hit access time [1]. Even smaller L1 caches can exhibit relatively long access times; for example, the 32 KB L1 data cache of the latest Intel micro-architecture (codename Sandy Bridge) has a 4-cycle access time [2]. Many contemporary applications are already optimized for such large L1 cache sizes. For instance, most popular computer architecture benchmarks exhibit very high hit ra- tios in a 64 KB L1 data cache. Accordingly, any performance improvement in these large L1 caches is naturally upper- bounded by that small room for hit ratio improvement. Alternatively, one can focus on reducing the access latency. In fact, “small-but-fast” caches located at the same level as the large, “slow” L1 data cache can precipitate a non- uniform data cache access effect with shorter average latency than the L1 cache alone. However, naively designed “small- but-fast” caches will simply cause an avalanche of “cache thrashing” events, from L1 all the way down to the memory subsystem. Therefore, only cache accesses with high tem- poral/spatial locality and with a small footprint are to be directed to the “small-but-fast” cache. From this perspective, the VA stack area is a great candidate for a “small-but-fast” L1 helper cache, as proven by previous papers [3], [4]. The stack area in VA space stores temporary data, such as local variables, register values, arguments, and return addresses of function frames, and pushes/pops all context components (register values) at a context switch. Processes created using high-level languages like C commence growing this run-time stack from the “main” function. In general, stack-area memory accesses play an instru- mental role in overall system performance. Research in this domain has shown that real workloads tend to access the VA stack area at a high frequency, albeit within a small memory footprint [3], [5], because a process/thread is always 2011 IEEE International Conference on High Performance Computing and Communications 978-0-7695-4538-7/11 $26.00 © 2011 IEEE DOI 10.1109/HPCC.2011.18 58

[IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

  • Upload
    jongman

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

A High-Performance and Energy-Efficient Virtually Tagged Stack CacheArchitecture for Multi-Core Environments

Suk chan Kang∗, Chrysostomos Nicopoulos†, Hyunggyu Lee∗ and Jongman Kim∗

∗School of Electrical and Computer Engineering, Georgia Institute of Technology,Email: {skang37, hyunggyu, jkim}@gatech.edu

†Department of Electrical and Computer Engineering, University of Cyprus,Email: [email protected]

Abstract—Virtually tagged caches possess a key attributethat renders them more attractive than Physically Indexed,Physically Tagged (PIPT) caches; they operate natively withinthe virtual address space, hence taking full advantage of theoriginal intention of a virtual memory implementation: theillusion of a contiguous address space. Consequently, virtuallytagged caches eliminate Translation Look-aside Buffer (TLB)references, yielding both energy and performance improve-ments. On the other hand, virtually tagged caches incursubstantial overhead in resolving homonym/synonym issues,which is a fairly complicated process in contemporary multi-core environments. In this paper, we aim to markedly alleviatethis overhead through the use of a new virtually tagged stackcache design specifically targeting multi-core environments. Itwill be demonstrated that special level-one virtually taggedstack caches can significantly boost the performance of asystem running a heavy, dominating, multi-threaded workload– among other applications – while actually reducing its energyconsumption. This scheme is aimed at modern server environ-ments that run a single, dedicated multi-threaded applicationworkload per server. The proposed virtually tagged stack cachefor multi-core processors minimizes the overhead incurred inresolving virtual-tag-related artifacts, by granting exclusiveaccess to only one multi-threaded workload at a time. In otherwords, said virtually tagged cache filters the Virtual Address(VA) spaces and subsequently handles only the stack areasof the selected virtual address space. A cost effective way toimplement the proposed stack cache for multi-core systems isalso presented, yielding average performance improvements ofaround 20%.

Keywords-Stack Cache, Non-Uniform Cache Access (NUCA),Multi-Threaded, Multi-Core, Fine-Grain Thread Parallelism,CR3 Register Filtering, TLB Power Consumption

I. INTRODUCTION

The memory hierarchy has been a mainstay of moderncomputer systems. The seemingly interminable divergencein the performance of memory and logic components (i.e.,the widening memory-logic gap) has cemented the memoryhierarchy as one of the principal components of all modernmicroprocessors. Atop this hierarchy sit the various cachelevels, which are typically implemented on the same silicondie as the Central Processing Unit (CPU) itself. In gen-eral, higher-level caches target short access latencies. Theyachieve this by limiting both the cache size and associativity.On the contrary, lower-level caches are designed primarily to

reduce the access misses (possibly at the expense of accesslatency). An intelligently designed small-size helper cachecan overcome some of the inherent shortcomings of level-one (L1) caches. Victim caches are one example of suchsmall helper caches that can provide additional associativity.The L1 cache design trade-off between capacity and accesslatency implies longer access times with larger cache sizes.For instance, a 64 KB L1 cache requires a two-to-three cyclehit access time [1]. Even smaller L1 caches can exhibitrelatively long access times; for example, the 32 KB L1data cache of the latest Intel micro-architecture (codenameSandy Bridge) has a 4-cycle access time [2].Many contemporary applications are already optimized

for such large L1 cache sizes. For instance, most popularcomputer architecture benchmarks exhibit very high hit ra-tios in a 64 KB L1 data cache. Accordingly, any performanceimprovement in these large L1 caches is naturally upper-bounded by that small room for hit ratio improvement.Alternatively, one can focus on reducing the access latency.In fact, “small-but-fast” caches located at the same level

as the large, “slow” L1 data cache can precipitate a non-uniform data cache access effect with shorter average latencythan the L1 cache alone. However, naively designed “small-but-fast” caches will simply cause an avalanche of “cachethrashing” events, from L1 all the way down to the memorysubsystem. Therefore, only cache accesses with high tem-poral/spatial locality and with a small footprint are to bedirected to the “small-but-fast” cache.From this perspective, the VA stack area is a great

candidate for a “small-but-fast” L1 helper cache, as provenby previous papers [3], [4]. The stack area in VA space storestemporary data, such as local variables, register values,arguments, and return addresses of function frames, andpushes/pops all context components (register values) at acontext switch. Processes created using high-level languageslike C commence growing this run-time stack from the“main” function.In general, stack-area memory accesses play an instru-

mental role in overall system performance. Research in thisdomain has shown that real workloads tend to access theVA stack area at a high frequency, albeit within a smallmemory footprint [3], [5], because a process/thread is always

2011 IEEE International Conference on High Performance Computing and Communications

978-0-7695-4538-7/11 $26.00 © 2011 IEEE

DOI 10.1109/HPCC.2011.18

58

Page 2: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

running within a function during its execution, and, usually,function frames do not tend to be very deep. It is preciselythis observation that has led to various designs that handlememory accesses to the stack area at a higher level in thememory hierarchy, in order to improve performance andreduce energy consumption. While such mechanisms cantake the form of either a large register file or a small cache[3], [5], the recent proliferation of the multi-core archetypetends to favor the latter solution, namely the stack cache.Modern Chip Multi-Processors (CMP) implicitly expectmulti-threaded workloads that require individual run-timestacks for each sub-thread. The presence of fine-grainedparallelism can cause frequent context switching among thevarious sub-threads. Hence, in such a setting, a stack cacheis likely to prove more beneficial than a register file, becauseeach sub-thread’s run-time stack can remain cached with noadditional overhead (in flushing register values to caches),even in the presence of frequent scheduling changes [6], [7].In summary, a “small-but-fast” stack cache makes sense

because:• VA stack accesses inherently exhibit high spatial/tem-poral locality within a small footprint (and cachethrashing is avoided)

• It can avoid the relatively long cache access times oflarge L1 caches

In addition to the “small-but-fast” trait, a stack cachecan reap further performance/energy benefits if designed asa Virtually Indexed, Virtually Tagged (VIVT) cache. Mostof the additional benefits emanate predominantly from theelimination of address translation, which results in loweroverhead, lower energy consumption, and higher perfor-mance.Despite the numerous advantages of VIVT caches, they

are primarily afflicted by the excessive overhead incurred bysynonym/homonym issues [8], [9]. While these pathologicalphenomena are accentuated in multi-core environments, theyare present even in uni-core systems. A synonym effectoccurs when different virtual addresses map to the samephysical address, giving rise to coherency problems (whenthese different virtual addresses are cached simultaneously,but separately). Conversely, homonyms are virtual addressesthat map to several different physical addresses. Homonymsare tackled by tagging process identification numbers, orby flushing the virtual caches at every VA-space contextswitch. Note that the flushing method also resolves thesynonym issues. Other ways to handle synonyms involvethe operating system restricting page allocation, albeit at thecost of increased page fault rates, or a hardware mechanismdetecting synonyms dynamically, while tolerating their ex-istence [8], [10]. To further compound the complexity ofhandling these phenomena, synonym/homonym issues areeven more prevalent in CMPs, where various multi-threadedworkloads are meant to be executed concurrently. Conse-quently, the above countermeasures become more complexand incur significant overhead in terms of both performance

and energy consumption [9]. Therefore, the advantageousproperties of VIVT stack caches are outweighed by theexcessive overhead associated with the handling of theseproblematic instances.The aim and fundamental driver of this work is to develop

a stack cache architecture that can extract all the advantagesof a VIVT setup, while minimizing the negative side effects.Toward this end, we hereby propose a modified VIVT stackcache that manually filters VA spaces so as to select only onemain VA space (process), whose run-time stack areas are tobe cached. Such a design is especially amenable to multi-core microprocessors running (among other applications)one heavy, dominating multi-threaded workload.Building on this premise, the main contributions of this

work are:• A stack cache implementation for multi-core systemsis proposed that reaps all benefits of a virtually taggedcache, while virtually eliminating all the overheadassociated with said caches. This characteristic is ofparamount importance in modern multi-core settings,where virtual-tag-related pathologies exacerbate the re-quired overhead. The resulting fast (but small) stackcache is shown to considerably increase overall systemperformance.

• A cost-effective way to implement such a low-costvirtually tagged stack cache for CMPs is presented,whereby the VA spaces of all running processes aremanually filtered in favor of the dominating multi-threaded workload. The efficacy and efficiency of thislow-cost heuristic is demonstrated through extensivesimulations, which indicate average performance im-provements of around 20%, even when using a smalleraggregate L1 data cache. Additionally, a 45.5% reduc-tion in TLB accesses is also observed, which stronglysuggests a lowering of energy consumption.

The rest of this paper is organized as follows: SectionII provides a concise preamble on virtual address space-sharing techniques in shared-memory architectures, and thenproceeds with a discussion on related prior work in the VAstack domain. Section III delves into the description, im-plementation, and analysis of the proposed virtually taggedstack cache, while Section IV presents the employed evalua-tion framework, the various experiments, and accompanyinganalysis. Finally, Section V concludes the paper.

II. BACKGROUND

A. The basics of virtual address space-sharing in shared-memory architectures

There are two prevalent methodologies that are typicallyutilized in shared-memory architectures:1) The Multi-threaded methodology (abstractly visual-

ized in Figure 1a): It operates on the principle of one mainthread and its N sub-threads residing in a single VA spaceand comprising one process. The main thread creates its

59

Page 3: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

sub-threads using a “clone()” system call. Consequently,the main thread and its sub-threads share the page table,global variable area, and file descriptor. The principle ofconfinement within a single VA space aims to alleviate theheavy overhead of context switching during parallel exe-cution. However, the main thread and all sub-threads musthave their own separate resources, such as stack area (user-level/kernel-level) and registers. User-level thread librarieslike POSIX Threads [11] and OpenMP [12] are based on thisconcept. In the x86 architecture, the main thread and its subthreads use the same “cr3” register to refer to the page tablefor the single shared VA space. Note that this “cr3” register-sharing activity by a multi-threaded process can easily bemonitored by a full-system simulator like Simics [13] (asemployed in our evaluation framework).2) The Shared memory inter-process communication

(IPC) (abstractly visualized in Figure 1b): It follows adifferent philosophy, which does not allow as much resourcesharing as possible. Instead, this type imposes strict VAspace protection among processes, while permitting oneshared area, aptly called the “shared memory segment”.The only shared resource is this shared memory segment.A programmer may utilize this methodology by creating N

processes with each one having a different VA space. Theprocesses interact through the OS-granted shared memorysegment. TheN processes are not aware of each other’s exis-tence, because they are conventionally created by a “fork()”system call from the “shell”. However, they can communi-cate with each other through the shared memory segment.The “POSIX shared memory segment” is an example of thismethodology. One downside is that this programming typeusually causes synonym problems when using VIVT caches(one can easily verify this phenomenon by simply observingthe process pointers to the shared memory segment). Whilethe process pointers of the different processes will havedifferent values, they still physically point to the same“shared memory segment” location.The virtually tagged stack cache architecture proposed in

this paper is targeted at the former style (i.e., multi-threaded;Figure 1a) of address space-sharing.

(a) Multi-threaded (b) Shared memory IPC

Figure 1: The two prevalent virtual address space-sharingtypes

B. Related work

The portion of memory devoted to the VA stack area(i.e., the place reserved mostly for local variables) hasattracted a lot of attention by researchers over the last severalyears. This interest stems from the special characteristicsof this memory area, which lends itself nicely to variousoptimizations. More specifically, the VA stack area exhibitshigh temporal and spatial locality within a fairly limitedmemory footprint.Mamidipaka et al. [14] proposed a separate stack-based

memory organization for embedded systems, in order tostore the callee saved registers and return addresses offunction calls. The separate stack structure achieved bothimproved system performance and reduced system pow-er/energy.Instead of using a single, multi-ported data cache, Cho et

al. [3] advocated the decoupling of the stack area data streamin wide-issue superscalar processors. The decoupled streamwas then directed to a queue and cache storage specificallyreserved for local variables.In the same vein, Huang et al. [4] improved the energy-

delay product metric of the microprocessor by employing aspecialized stack cache and a pseudo-set-associative cachefor L1 data.Lee et al. [5] proposed a non-cache type stack area storage

structure, the so called “stack value file”. This is a bigregister file architected as a circular buffer and designedto improve instruction-level parallelism, reduce stack accesslatencies, reduce the demand on the first-level cache, andreduce data bus traffic. The authors of [5] also analyzedsome important properties of the stack area. For example, theTop-of-Stack (ToS) decides which area in the stack segmentis meaningful or garbage. By using this property, they wereable to reduce unnecessary fetching and write-backs of stackvalues.Finally, Woo et al. [10] reduced the synonym lookup

energy in a virtual cache by using Bloom filters. This iscrucial, since synonym issues should be resolved in order totake full advantage of virtual caches.Despite the plethora of prior research in this area, all these

papers assumed a uni-core processor environment, and thepathological synonym/homonym issue of virtually taggedcaches was not addressed in most of these techniques. It isclear that the ongoing transition into the multi-/many-coreera has given a new flavor to stack caches. The existenceof multiple cores accentuates the synonym/homonym issuesthat afflict virtually tagged caches. Given the various advan-tages of said caches, it becomes imperative to address thesepotential show-stoppers in their earnest. This is preciselythe contribution and major differentiator of our work: amechanism that allows the use of virtually tagged stackcaches in multi-core settings. The proposed techniquefacilitates a straightforward and lightweight incorporation ofstack caches in real server environments.

60

Page 4: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

III. A VIRTUALLY TAGGED STACK CACHE FORMULTI-CORE MICROPROCESSORS

A. Manual virtual address space filteringThe stack cache constitutes the quintessential embodiment

of a “small-but-fast” cache that can satisfy the stack arearequirement of high frequency accesses within a confinedmemory footprint. Moreover, owing to its small size, thestack cache can elevate system performance with minimalpower overhead. Huang et al. [4] have validated this asser-tion by demonstrating the advantages of stack caches usinga combined performance-power product metric.As previously mentioned, if a stack cache is designed as

a VIVT cache, it can offer significant advantages over aphysically tagged equivalent. Most of the additional benefitsemanate predominantly from the elimination of addresstranslation:

• Larger allowed size for the “small-but-fast” cache:Physically tagged caches require address translation onevery memory access. Therefore, the address translationpath latency from the TLB to the cache would in-evitably lower the maximum size that would still allowfor a single-cycle hit. On the contrary, by removing theTLB access, a VIVT design enables larger cache sizesthat are still able to achieve a single-cycle delay.

• Lower energy consumption: Typically, the TLB haslarge (or full) associativity. This high associativitycomes at the cost of energy expensive tag-matchingoperations. Instead, a VIVT cache can skip this process,as well as further overhead associated with a TLB miss.

• Easy stack cache manipulation: As previously men-tioned, the Top-of-Stack determines which areas inthe VA stack contain meaningful data and which onescontain garbage. The work in [4] adopts a VIVT stackcache for easy manipulation of the ToS, since a VIVTcache need not be concerned with the non-contiguousnature of physical addresses.

However, it is imperative to address the pathologicalscenarios of synonyms and homonyms, which incur undueoverhead in typical VIVT caches. In an effort to extractall the inherent benefits of virtually tagged caches, whileminimizing the adverse side effects, we introduce a mod-ification to the stack cache’s modus operandi, such that itmanually filters out all VA spaces except one main VA space(process). Thus, the run-time stack areas of this lone processbecome the exclusive occupants of the stack cache. Thisdesign naturally favors modern server environments that runone heavy, dominating multi-threaded workload.The proposed VA space filtering approach within the stack

caches results in the following important attributes:• No synonyms: The run-time stacks of the single se-lected VA space are not shared among the main/sub-threads of the VA, as illustrated in Figure 2. Notethat, even within one VA space, synonyms can still bedeliberately created by the programmer, as described

Figure 2: Non-overlapping separate run-time stacks formain/sub-threads in one VA space, as maintained in the“single-VA-space” multi-threaded programming style.

in [8], [9]. In other words, synonyms may be definedon purpose by the programmer for convenience, but, insuch cases, the responsibility lies with the programmerand, in fact, this kind of programming style is neithercustomary, nor advisable. In our experiments, we as-sume that such programming style is not followed.

• No homonyms: Since other VA spaces (processes) arefiltered out of the stack cache, they do not interferewith the selected, active VA space.

• Lower design complexity: Free fromsynonyms/homonyms, our VIVT stack cache canoperate (and maintain coherence) with similarly lowcomplexity as a physically tagged cache, by usingonly the virtual addresses of the selected VA spacefor both tags and indexes. This property reduces bothaccess latencies and energy consumption.

• Reduced stack cache thrashing: Since the stack cacheis very small, it cannot – by construction – contain thestack values of the other non-performance-dominatingprocesses.

The crux of this technique is the exploitation of thedominating nature of the one, heavy, multi-threaded ap-plication. As dictated by Amdahl’s Law, this applicationmostly determines the overall system performance and is,thus, protected by filtering out the remaining applicationsand preventing them from using the proposed stack cache.The experimental evaluation in Section IV corroborates thisclaim by reporting average performance improvements ofaround 20%, even when using a smaller aggregate L1 datacache.

B. Cache coherence requirements for the VA stack areaCache coherence dictates that all cache lines shared

by multiple cores have the same valid value at a givenmoment. Memory consistency dictates that all memory writeoperations are seen by all cores in the same order. We herebybriefly recap the fundamental cache coherence requirementsunder an invalidation-based cache coherence mechanism.1) The general case with multiple threads: Figure 3a

describes the general case where a coherence protocol is

61

Page 5: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

needed. Multiple threads are running on all CPU cores. Inthe particular example illustrated in the figure, three cachelines are shared at a specific time. The simplest way to keep“coherence” is just to use “valid” and “invalid” states foreach cache line. However, with only those two states, aninvalidation signal must be issued on every memory writeoperation. Given that these invalidation signals cause thecaches of all CPU cores to stall, one must try to limitthe frequency of signals as much as possible. Conventionalcache coherence protocols achieve this by further dividingthe “valid” state into, say, “Modified (M)”, “Owner (O)”,“Exclusive (E)”, and “Shared (S)” states. All shared-memoryarchitectures must provide a cache coherence mechanism forthe general case(s).2) Self coherence case with only one thread: Figure 3b

is an interesting special case where non-shared cache linesbelonging to only one process/thread also need a coherencemechanism, due to CPU core migration. The variable “a”in the figure is not shared with any other thread. However,a coherence protocol must (a) transfer variable “a” whenthe thread moves from one CPU core to another and (b)update/invalidate the value(s). This migration can happen if“CPU-affinity” scheduling is not set and the multi-core CPUscheduling mechanism works in a “work-conserving” way.Even though pure VA stack area memory accesses are

not shared by other threads/processes, a stack cache stillrequires a coherence mechanism for “self coherence” effects(assuming the CPU scheduling permits migration).

C. The VA run-time stack frame in the x86 architectureListing 1 explains the x86 architecture function frame

by juxtaposing a C source code snippet and its assemblycode equivalent, as generated with “objdump -d”. An x86

stack frame shrinks upwards and grows downwards, and isdefined by two pointer registers, “ebp” and “esp”. “ebp” isthe frame pointer and “esp” is the Top-of-Stack pointer. Thestack area variables are referenced by their relative positionfrom either “ebp” or “esp”. Most important values, such asfunction arguments, the return address, and local variables,use “ebp,” because “esp” can vary after a function frame iscreated.In the x86 architecture, stack area memory accesses are

decided by the range between the ToS and the origin positionof the run-time stack, say, [ToS, Origin of a run-time stack],as shown in Figure 4b. The ToS is directly obtained fromthe %esp value, while the origin of a run-time stack isdetermined by the maximum “ebp” values of the run-timestacks for user and kernel, respectively. The method to obtainthese maximum values is explained in Listing 2 in Verilog-style pseudo code. The code is working in an updatingmanner, whenever values “current dpl”, “max ebp user”,“max ebp kernel,” and “current ebp” vary. The value “cur-rent dpl” is the descriptor privilege level of the run-timestack and it decides if the current run-time stack is foruser or kernel. If it is set to “3”, the run-time stack is in

(a) The general case with multiple threads

(b) Self coherence case with only one thread

Figure 3: Cache coherence requirements under aninvalidation-based cache coherence mechanism

“ring 3,” which implies user-level privilege. A value of “0”implies “ring 0,” i.e., kernel-level privilege. The resulting“max ebp user” and “max ebp kernel” at a given momentspecify the origin of the run-time stack of the selected VAfor the user and kernel stacks, respectively.Figure 2 illustrates an important property of the “multi-

threaded” methodology of VA space sharing (as described inSection II-A); namely, the divided run-time stack size effect.All the main and sub-threads must have their own individualrun-time stack area for both user/kernel levels, in order toavoid a stack area crash. Those run-time stacks are neededfor user function calls, system calls, and context switches.A programmer should be careful with the maximum stackarea sizes of the main and sub-threads of applicationsemploying a “multi-threaded” methodology of VA spacesharing, especially when creating many sub-threads.Multi-threaded workloads use “CPU-affinity” scheduling

to leverage cache locality by preventing expensive cache-to-cache transfers caused by CPU-core migration. If the “CPU-affinity” attribute is set in our target multi-threaded work-load, the above-mentioned VA stack property (i.e., divided,individual run-time stack areas for the sub-threads) actuallyeliminates the necessity for a cache coherence mechanism inthe proposed VIVT stack cache. In other words, CPU-coremigration no longer requires explicit “self-coherence,” sincethis phenomenon cannot occur.

62

Page 6: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

Listing 1: An x86 function frame (file format elf32-i386)<foo>:push %ebp / / s t o r e c a l l e r %ebp on t h e s t a c k i n tmov %esp ,%ebp / / upda te c a l l e e %ebp foo ( i n t a , i n t b )sub $0x10 ,%esp / / lower c a l l e e ToS by 0 x10 {mov 0xc(%ebp ) ,% eax / / “b” r e f e r e n c e d by 0 xc (%ebp ) i n t sum ;add 0x8(%ebp ) ,% eax / / “a” r e f e r e n c e d by 0x8(%ebp ) sum = a + b ;mov %eax , 0 x f f f f f f f c (%ebp ) / / “sum” r e f e r e n c e d by 0 x f f f f f f f c (%ebp ) r e t u r n sum ;mov 0 x f f f f f f f c (%ebp ) ,% eax / / pu t r e t u r n va lue in %eax }l e a v e / / r e c o v e r c a l l e r ToS from c a l l e e %ebpr e t / / r e t u r n to c a l l e r s t a c k frame

<bar >:push %ebp / / s t o r e c a l l e r %ebp on t h e s t a c k i n tmov %esp ,%ebp / / upda te c a l l e e %ebp bar ( )sub $0x18 ,%esp / / lower c a l l e e ToS by 0 x18 {movl $0x14 , 0 x4(%esp ) / / “20” f o r f oo (10 , 20) i n t sum ;movl $0xa ,(% esp ) / / “10” f o r f oo (10 , 20) sum = foo (10 , 2 0 ) ;c a l l 2 a <ba r +0x16> / / c a l l f oo ( ) r e t u r n sum ;mov %eax , 0 x f f f f f f f c (%ebp ) / / “sum” r e f e r e n c e d by 0 x f f f f f f f c (%ebp ) }mov 0 x f f f f f f f c (%ebp ) ,% eax / / pu t r e t u r n va lue in %eaxl e a v e / / r e c o v e r c a l l e r ToS from c a l l e e %ebpr e t / / r e t u r n to c a l l e r s t a c k frame

Listing 2: The method used to obtain the maximum “ebp” valuesalways@ (

c u r r e n t d p l o r / / c u r r e n t s t a c k segment ’ s d e s c r i p t o r p r i v i l e g e l e v e lmax ebp user o r / / maximum use r l e v e l ebp va lue o f t h e f i l t e r e d VA so f a rmax ebp kerne l o r / / maximum ke r n e l l e v e l ebp va lue o f t h e f i l t e r e d VA so f a rc u r r e n t e b p / / c u r r e n t ebp va lue o f t h e f i l t e r e d VA

)b eg in

/∗ i f t h e d e s c r i p t o r p r i v i l e g e l e v e l i s u s e r l e v e l ( r i n g 3) ∗ /i f ( c u r r e n t d p l == 3)

max ebp user <= MAX ( max ebp user , c u r r e n t e b p ) ;/∗ i f t h e d e s c r i p t o r p r i v i l e g e l e v e l i s k e r n e l l e v e l ( r i n g 0) ∗ /e l s e i f ( c u r r e n t d p l == 0)

max ebp kerne l <= MAX ( max ebp kernel , c u r r e n t e b p ) ;end

D. The implementation of the proposed VIVT stack cachefor multi-core systems

The proposed VIVT stack cache revolves around thenotion of VA space filtering. This technique allows one ofthe running applications to receive preferential treatmentthrough an exclusive access to the stack cache. Any appli-cation can be chosen by the system administrator (of, say,a large datacenter, or a computer cluster, or a large-scaleserver) to receive this special status. Naturally, the selectedapplication is one that tends to dominate all other runningapplications at a particular point in time. Once a program isgranted this status, its VA space will be granted exclusiveaccess of each core’s stack cache in order to boost its overallperformance.The filtering technique provides our proposed VIVT stack

cache with four elemental advantages that constitute keyenablers in the inclusion of such caches in multi-coreenvironments: (a) Elimination of the synonym issue, sinceonly one VA space uses the stack cache at any given time;

(b) Elimination of the homonym issue, since the VA spacesof other running applications do not interfere with the activeVA space; (c) As a consequence of (a) and (b) above, theVIVT stack cache can maintain operation (and coherence)with similarly low complexity as a physically tagged cache,by using only the virtual addresses of the selected VA spacefor both tags and indexes; (d) Stack cache thrashing isminimized as a result of the minuscule cache size (only thevalues of one VA space are stored at a time).

Figure 4 illustrates the proposed mechanism. More specif-ically, Figure 4a depicts the location of the VIVT stackcache. The non-stack L1 data caches are implemented asPhysically Indexed, Physically Tagged (PIPT), whereas theaccompanying stack caches are Virtually Indexed, VirtuallyTagged (VIVT). It is important to note that the stack cacheis small enough to guarantee a hit access latency of a singleCPU cycle, while the L1 caches (Data/Instruction – for non-stack region) require two to three CPU cycles for a cache hit(in line with most L1 caches found in commercially available

63

Page 7: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

CPUs today).As previously mentioned, use of the stack cache is granted

exclusively to only one selected VA space (process). ThisVA space filtering is achieved through the “cr3” register.The “cr3” register in the x86 architecture is the registerpointing to a VA space’s page table, and can be usedas a unique VA space identifier [15]. Hence, a systemadministrator can deliver a specific “cr3” register value tothe VIVT stack cache, in order to boost the performance of(potentially) a dominating, heavy, multi-threaded workload.This can be achieved through the invocation of a specialcommand. Note, that this command is not a new ISAinstruction, since the “cr3” register is already accessible tothe programmer. Instead, operating system designers maycreate a new command that manages the newly proposedstack cache resources through the use of “cr3” register valuemanipulations. This command would, essentially, work in afashion similar to the Unix “nice” command, which modifiesthe CPU scheduler priorities of all running processes inorder to assign more CPU time to specific processes. Upona change in the value of the “cr3” register, all VIVT stackcaches flush their dirty cache lines into the Level 2 datacache. This process prepares the stack caches to commencethe handling of the newly selected VA space. The “cr3” VAspace filtering mechanism was modeled in our evaluationsusing the Simics “Magic” instruction.During VA space filtering, data cache accesses are directed

in two separate pathways (designated by the labels “1” and“2” in Figure 4a). Pathway “3” in Figure 4a denotes linefetching from the L2 cache, caused by L1 stack cache missesthat require TLB lookup. Pathway “4” in Figure 4a denotesdirty write-backs from the L1 stack caches to the L2 cache,which also require TLB lookup. The stack area memoryreferences of the active VA space are directed to path “1”(i.e., to the stack cache), while all other memory referencesare directed to path “2” (i.e., the regular L1 cache). Asexplained in Section III-B, a coherence mechanism is stillrequired for the stack caches, so as to handle self-coherenceissues. Therefore, a “MESI” cache coherence protocol wasemployed for the stack caches, as well as the regular L1data caches.

IV. EXPERIMENTAL EVALUATIONA. Simulation frameworkIn order to comprehensively evaluate the operational effi-

cacy and efficiency of the proposed virtually tagged cache,we employ Simics, a full-system simulator developed byWind River [13]. Simics provides a “Magic” instruction thatenables the simulated software on the target platform to de-liver events to the simulator itself. The “Magic” instructionis incorporated as a special No Operation (NOP) instructionwithin the Simics’s target ISA, and, hence, can be directlyimplemented in real computer systems. We use the “Magic”instruction to correctly acquire each multi-threaded bench-mark’s “cr3” register value, which is essential to the VA

(a) A high-level overview of the proposed architecture

(b) The VA stack area range

Figure 4: Implementation of the proposed VIVT stack cachefor multi-core systems

space filtering mechanism. The benchmark applications usedin our simulations come from the PARSEC benchmark suite[16]. PARSEC is an increasingly popular benchmark suitecontaining parallel workloads from various emerging appli-cations that are considered representative of next-generationshared-memory programs for CMPs. Furthermore, all in-dividual benchmarks can be executed in a multi-threadedmode and are, therefore, suitable for our proposed stackcache architecture for multi-core processors. The benchmarkapplications are built using the “parsecmgnt -a build -p all -cgcc-hooks” command option. In addition to enabling multi-threading, this option also helps utilize Simics’s “MagicInstruction Breakpoint”. The Magic Breakpoint capabilityallows the simulator to accurately measure the performanceof the PARSEC benchmarks within a so called Region ofInterest (RoI), which is a part within the program’s executiontime that is deemed to be performance-meaningful.The simulated target is the Simics “Tango machine” in

a multi-core setting. This machine models an Intel Pentium4 processor (x86-440bx machines). We simulate a systemwith 8 processing cores running the Fedora Core 5 OperatingSystem (with Linux kernel 2.6.15, including SMP support).

B. Designs under evaluationWe compare two designs using the above-mentioned full-

system simulation framework; namely, a “Baseline” design

64

Page 8: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

Simulation ConfigurationDesign (Regular L1D + Stack L1D) Component Size/Number Information

Baseline (32 KB + 0 KB)

L1 (data) 32 KB; 64 B; 2 way 3-cycle hit; write back/allocateL1 (instruction) 32 KB; 64 B; 2 way 3-cycle hitL2 (unified/shared) 512 KB; 128 B; 8 way 12-cycle hit; write back/allocateMain Memory 2 GB 218-cycle stallCPU Cores 8 Multi-threaded workloads each 16 sub-threadsCoherence Protocol - MESI

Proposed (16 KB + 2 KB)

L1 (data) 16 KB; 64 B; 2 way 3-cycle hit; write back/allocateL1 (stack) 2 KB; 64 B; 2 way 1-cycle hit; write back/allocateL1 (instruction) 32 KB; 64 B; 2 way 3-cycle hitL2 (unified/shared) 512 KB; 128 B; 8 way 12-cycle hit; write back/allocateMain Memory 2 GB 218-cycle stallCPU Cores 8 Multi-threaded workloads each 16 sub-threadsCoherence Protocol - MESI

Table I: The simulation configuration of the two designs under evaluation

and the “Proposed” architecture. The “Baseline” systemserves as the reference point and employs a 32-KByte L1data cache per core, which is twice as large as the one in the”Proposed” setup (i.e., 16 KBytes per core). Of course, the“Proposed” design has an additional 2-KByte L1 virtuallytagged stack cache, for a total of 18 (16 + 2) KBytes ofL1 data cache per core. Despite this unfair comparison thattilts strongly in favor of the baseline design, it will bedemonstrated in the following sub-sections that the proposeddesign comfortably outperforms the generic architecture,despite the significantly smaller L1 data cache. Both the 32-and 16-KByte L1 data caches (of the baseline and proposedarchitectures, respectively) have a 3-cycle hit latency. The 2-KByte stack cache has a single-cycle hit latency, similar tothe one proposed in [4]. Table I summarizes the two designconfigurations.

C. Overall performance evaluationFigures 5a and 5b clearly corroborate our prior assertions

that the proposed virtually tagged cache instills dramatic per-formance improvements over the much larger L1 data cacheof the baseline design. The average overall performanceimprovement was found to be 19% for the 10 benchmarkapplications tested. Of course, the performance boost isattributed to the presence of the “small-but-fast” virtuallytagged stack cache. Figure 6a depicts the breakdown statis-tics of the memory access patterns. One can notice the largepercentage of accesses served by the stack cache. In fact,around 50% (on average) of all memory accesses can bedirected to the small stack cache. It is our belief that such ahigh percentage of accesses easily justifies the incorporationof our proposed virtually tagged stack cache in future multi-core systems.

D. TLB access behaviorFigure 6a can also help us comprehend the TLB access

behavior of the two designs, which is presented in Figure 6b.Since the stack cache read/write hit ratios are very high, theTLB access reduction percentages shown in Figure 6b arealmost identical to the sum of the “Stack L1 Data Write” and

(a) Overall latency normalized to the baseline design

(b) Overall performance improvement over the baseline design

Figure 5: Performance improvements achieved using theproposed virtually tagged stack cache aimed at multi-coremicroprocessors

“Stack L1 Data Read” portions in Figure 6a. It is importantto note that the TLB access reductions in Figure 6b are adirect consequence of the virtually tagged stack cache hits,which obviate the need for TLB accesses. In fact, a TLBaccess is only needed in the infrequent cases of a stackcache read/write miss and a stack cache dirty write-back, (asillustrated by pathways “3” and “4”, respectively, in Figure4a). On average, the proposed design reduces the amount ofTLB accesses by 45.5%. This substantial reduction in TLBreferences will, in turn, translate into a reduction in energy

65

Page 9: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

consumption. As previously described in Section III-A, TLBaccesses are energy expensive, due to the high associativitiesthat typically characterize TLB implementations.

(a) Frequency breakdown of the memory area accesses

(b) Reduction in TLB accesses in the proposed (16KB + 2KB stack cache)design over the baseline (32K and no stack cache) design

Figure 6: Memory access pattern analysis

E. L1/L2 statistics analysisAs expected from the smaller L1 data cache (18 KB vs.

32 KB) of the proposed architecture, the L1 data cache hitratio is noticeably lower, as compared to the baseline design.Figure 7 illustrates the L1 data read/write hit ratios of thetwo evaluated configurations. In the case of the proposeddesign, the L1 ratios refer to the combined hit ratios of theregular L1 data cache and the stack cache. Naturally, thelower L1 data hit ratios in the proposed architecture resultin increased L2 cache accesses.Figure 8 shows the effect of the presence of the stack

cache on the L2 cache behavior. Despite the elevated numberof L2 cache accesses in the proposed design (caused by thelower L1 hit ratios), the existence of the stack cache bene-ficially alters the operational behavior of the L2 cache. Asillustrated in Figure 8, all benchmark applications runningon the new design experience increased read and write hitratios in the L2 cache. This is because the stack cache’soperation leads to the precipitation of a strong partitioningeffect against data access thrashing between the L1 andL2 caches. This partitioning effect enables, for example,the “canneal” benchmark to exhibit an overall performance

(a) L1 data read hit ratios

(b) L1 data write hit ratios

Figure 7: L1 data cache read/write statistics of the twoevaluated architectures

improvement of over 15%, even though it suffers from arelatively low L1 read hit ratio, as shown in Figure 7. Thisresult is very important, since it showcases the multi-levelbenefits of the proposed stack cache. Its presence not onlylowers the average L1 data cache hit latency, it also improvesthe performance of the L2 cache. Hence, the combined effecton overall performance far outweighs the reduction in L1hit ratios (resulting from the smaller overall L1 regular datacache size).

V. CONCLUSIONStack-area memory accesses have a profound effect on

overall system performance, since application workloadstend to access the stack area at a very high frequency.This attribute lends itself nicely to the incorporation of alevel-one helper cache dedicated to stack accesses. Besidesthe obvious benefits emanating from such a “small-but-fast” stack cache, further gains may be reaped if thisstructure is virtually (rather than physically) tagged. Whilevirtually tagged caches avoid the performance and energycosts of address translation, they inherently suffer from syn-onym/homonym issues, which incur considerable overheadto resolve. Moreover, the resolution of these pathologicalartifacts is even more complicated and costly in multi-core environments. Given the current, ubiquitous nature ofmulti-core microprocessors and their front-runner role in

66

Page 10: [IEEE Communication (HPCC) - Banff, AB, Canada (2011.09.2-2011.09.4)] 2011 IEEE International Conference on High Performance Computing and Communications - A High-Performance and Energy-Efficient

(a) L2 read hit ratios

(b) L2 write hit ratios

Figure 8: L2 cache read/write statistics of the two evaluatedarchitectures

exploiting thread-level parallelism in the future, it becomesimperative to devise a methodology that efficiently integratesvirtually tagged stack caches in CMPs.Toward this end, we propose the first – to the best of

our knowledge – stack cache architecture geared towardmulti-core processors. The presented modified stack cacheemploys a manual filtering technique that selects the run-time stack areas of only one VA space (process) for cachinginto the stack cache. The filtering process greatly alleviatesthe afflictions of virtually tagged caches, while, at the sametime, it enables the extraction of all associated performanceand energy advantages. The proposed stack cache mecha-nism is especially amenable to contemporary server systemsthat tend to run one primary, heavy application per servermachine for increased reliability.Comprehensive evaluation using a full-system simulation

environment conclusively proves the efficacy and efficiencyof the proposed virtually tagged stack cache for multi-coresettings. Average performance improvements of around 20%validate the potential of such design (even while using asmaller aggregate L1 data cache). Additionally, a 45.5%reduction in TLB accesses is observed, lending credenceto our claim of reduced energy consumption. Overall, ourresults prove the viability of such helper stack caches infuture CMPs and demonstrate their powerful capabilities.

ACKNOWLEDGMENTThis work is partially supported by KORUSTECH(KT)-

2008-DC-AP-FS0-0003. It also falls under the Cyprus Re-search Promotion Foundation’s Framework Programme forResearch, Technological Development and Innovation 2009-10 (DESMI 2009-10), co-funded by the Republic of Cyprusand the European Regional Development Fund, and specif-ically under Grant TΠE/ΠΛHPO/0609(BIE)/09.

REFERENCES[1] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki,

“Reactive NUCA: near-optimal block placement and replica-tion in distributed caches,” in Proceedings of the 36th annualinternational symposium on Computer architecture, ser. ISCA’09. ACM, 2009, pp. 184–195.

[2] Intel Sandy Bridge Architecture. [Online]. Avail-able: http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=7

[3] S. Cho, P.-C. Yew, and G. Lee, “Decoupling local variableaccesses in a wide-issue superscalar processor,” in Proceed-ings of the 26th annual international symposium on Computerarchitecture, ser. ISCA ’99. IEEE Computer Society, 1999,pp. 100–110.

[4] M. Huang, J. Renau, S.-M. Yoo, and J. Torrellas, “L1 datacache decomposition for energy efficiency,” in Proceedings ofthe 2001 international symposium on Low power electronicsand design, ser. ISLPED ’01. ACM, 2001, pp. 10–15.

[5] H.-H. S. Lee, M. Smelyanskiy, G. S. Tyson, and C. J.Newburn, “Stack value file: Custom microarchitecture for thestack,” in Proceedings of the 7th International Symposiumon High-Performance Computer Architecture, ser. HPCA ’01.IEEE Computer Society, 2001, pp. 5–.

[6] A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz, “April:a processor architecture for multiprocessing,” in Proceedingsof the 17th annual international symposium on ComputerArchitecture, ser. ISCA ’90. ACM, 1990, pp. 104–114.

[7] S. Kumar, C. J. Hughes, and A. Nguyen, “Carbon: archi-tectural support for fine-grained parallelism on chip multi-processors,” in Proceedings of the 34th annual internationalsymposium on Computer architecture, ser. ISCA ’07. ACM,2007, pp. 162–173.

[8] M. Cekleov and M. Dubois, “Virtual-address caches part1: Problems and solutions in uniprocessors,” IEEE Micro,vol. 17, pp. 64–71, September 1997.

[9] ——, “Virtual-address caches, part 2: Multiprocessor issues,”IEEE Micro, vol. 17, pp. 69–74, November 1997.

[10] D. H. Woo, M. Ghosh, E. Ozer, S. Biles, and H.-H. S.Lee, “Reducing energy of virtual cache synonym lookupusing Bloom filters,” in Proceedings of the 2006 internationalconference on Compilers, architecture and synthesis for em-bedded systems, ser. CASES ’06. ACM, 2006, pp. 179–189.

[11] D. R. Butenhof, Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., 1997.

[12] OpenMP. [Online]. Available: http://www.openmp.org/[13] Wind River Systems. [Online]. Available: http://www.

windriver.com[14] M. Mamidipaka and N. Dutt, “On-chip stack based memory

organization for low power embedded architectures,” in Pro-ceedings of the conference on Design, Automation and Test inEurope - Volume 1, ser. DATE ’03. IEEE Computer Society,2003, pp. 11 082–.

[15] G. Venkatasubramanian, R. J. Figueiredo, R. Illikkal, andD. Newell, “A simulation analysis of shared TLBs with tagbased partitioning in multicore virtualized environments,” inWorkshop on managed many-core systems, ser. MMCS ’09,2009.

[16] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSECbenchmark suite: characterization and architectural implica-tions,” in Proceedings of the 17th international conference onParallel architectures and compilation techniques, ser. PACT’08. ACM, 2008, pp. 72–81.

67