Optimising the Garbage Collector using the Cache€¦ · Optimising the Garbage Collector using the Cache Brenda Wang A report submitted for the course ... 2.4 An

Optimising the Garbage Collectorusing the Cache

Brenda Wang <u6374399>

A report submitted for the courseIndividual Research Project (COMP3770)

The Australian National University

October 2019

DRAFT – 25 October 2019

c© Brenda Wang <u6374399> 2019


Except where otherwise indicated, this report is my own original work.

Brenda Wang <u6374399>25 October 2019



Contents

1 Introduction 11.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Multi-level Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.5 Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.6 Cache Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Garbage Collectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Mutator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Cache Pollution 113.1 Conflict Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Cache Allocation Technology 154.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 The Mutator and The Collector . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Cache Colouring 195.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Young and Mature Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.1 Semispace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2.2 Generational Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

v


vi Contents

6 Experimental Methodology 236.1 Software platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Hardware platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Results 257.1 Cache Allocation Technology . . . . . . . . . . . . . . . . . . . . . . . . . 257.2 Cache Colouring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2.1 Semispace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2.2 Generational Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8 Conclusion 298.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8.1.1 Cache Allocation Technology . . . . . . . . . . . . . . . . . . . . . 298.1.2 Cache Colouring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Appendix 318.1.3 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.5 Project Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Bibliography 35


List of Figures

2.1 The memory hierarchy. As size increases, speed of access decreases. . . 32.2 Diagram of the memory hierarchy, but the cache being split into three

levels. As the levels decrease, the size and hence the latency decreases.However, the cost is that the smaller size causes a greater number ofcache misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 An example of a set-associative cache. There are 8 sets and 4 cachelines. This cache is then mapped to memory, with blocks correspondingto certain sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 An example of the Semispace Collector algorithm. . . . . . . . . . . . . . 82.5 The identification of live and dead objects. All live objects are reachable

from the roots. Dead objects are unreachable. . . . . . . . . . . . . . . . 82.6 An example of the Generational Copy Collector algorithm . . . . . . . . 9

3.1 An example of a cache miss using a program with 2 instructions. Thememory location 0x00 is evicted from the cache in replace of 0x10,only to be later fetched. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Diagram of COS and CBMs for a cache with 12 ways in total. . . . . . . 15

5.1 An example of cache colouring. Each page in physical memory ismapped to a proportion of the cache of the same colour. The cache iscomposed of 4 colours, with 8 ways in total. . . . . . . . . . . . . . . . . 20

5.2 Comparison of normal pages and huge pages. . . . . . . . . . . . . . . . 21

8.1 A copy of the Independent Study Contract . . . . . . . . . . . . . . . . . 328.2 A copy of the Independent Study Contract . . . . . . . . . . . . . . . . . 33

vii


viii LIST OF FIGURES


List of Tables

6.1 Processors used in our evaluation. . . . . . . . . . . . . . . . . . . . . . . 24

7.1 Result of using Cache Allocation Technology. For details of the SemiS-pace garbage collector, see Section 2.2.4. . . . . . . . . . . . . . . . . . . . 25

7.2 Comparison of cache colouring using SemiSpace. For details of theSemiSpace garbage collector, see Section 2.2.4. . . . . . . . . . . . . . . . 26

7.3 Comparison of cache colouring using GenCopy (Generational Copy).For details of the GenCopy garbage collector, see Section 2.2.4. . . . . . 27

ix


x LIST OF TABLES


Chapter 1

Introduction

1.1 Abstract

The performance of the cache has a profound impact on the performance of theoverall program. Due to poor temporal locality of garbage collectors, a major hitto the performance of the cache is a result of cache misses. This report highlightsmethods to reduce the rate of cache misses and hence improve the performance of thecache and the overall system. This is done through two techniques, cache colouringand cache allocation technology. We demonstrate that using these techniques domitigate cache misses. However, we also see that utilising cache allocation technologydoes not improve the performance of the system due to high overhead and littlebenefit gains. On the other hand, we show that cache colouring has low enoughoverhead and high enough benefit gains that we see an improvement to the overallperformance of the system on average.

1.2 Introduction

Cache performance plays a substantial role in the performance of the overall systemas memory latency is a clear bottleneck of program performance [19, 2]. This cacheperformance is heavily linked to the rate of cache misses. The mitigation of cachemisses would allow for an reduction in memory latency and hence an improvementin the performance of the overall system.

Since the CPU cache is located much closer to the processor in comparison tothe random access memory (RAM) and disk, it provides much faster access to data.Although it trades off in size with only fractions that of the RAM, it is incrediblyimportant due to the reduction in data access times. As long as data is located in thecache, rather than fetching data from the RAM, it can fetch it directly from the cache,reducing latency.

On the other hand, if the data is not in the cache, then it need to fetch the newentry from memory and evict an existing entry fro the cache. This failure to read orwrite data immediately results in a cache miss and causes much longer latency. Thereare multiple reasons for cache misses [7], but one of the most important is that ofconflict misses. Conflict misses are misses that could have been avoided if the cache

1


2 Introduction

had not evicted the entry earlier.Conflict misses can occur as a result of the poor temporal locality present in

the garbage collectors. Garbage collectors often performed tasks such as copying,transitive closures and more, which would touch every live object on the heap aboutonce [3]. As such, collection would often evict large amounts of useful data fromthe cache, replacing them with unneeded data and causing what is known as cachepollution.

Another way conflict misses could occur is a result of the competition betweenyoung and mature objects. Young, short-lived objects tended to pollute the cache [2],potentially pushing out important and heavily reused mature objects. Like previously,this pollution of the cache would result in the degradation of overall performance.

There are multiple existing solutions to reduce memory latency including bypass-ing the cache [9, 17] and pinning particular portions of the cache [14]. In this report,we will use cache partitioning techniques [18, 8] such as page colouring and Intel’sCache Allocation Technology (CAT) as another method in order to help mitigate theproblems above. Additionally, we will also evaluate the efficacy of utilising thesetechnique with regards to their benefits as well as their overhead.

To summarise, my contributions are:

• Investigated potential optimisations of the cache for garbage collectors

• Implemented page colouring and Cache Allocation Technology in JikesRVM

• Evaluated the impact of page colouring and Cache Allocation Technology

– Cache Allocation Technology does not seem very promising as it can onlyoperate on the L3 cache currently and the overheads are greater than theminimal performance improvements

– Page colouring has shown to bring performance improvements ( 0.5% onaverage with up to 5% for particular benchmarks)

1.3 Outline

How many chapters you have? You may have Chapter 2, Chapter ??, Chapter 6,Chapter 7, and Chapter 8. Finish thisFinish this


Chapter 2

Background

This chapter will give a overview of the necessary background to understandingthe problem. Section 2.1 will give an overview of the cache, its operations and itsstructure. Section 2.2 will give an overview of garbage collectors and a descriptionof the algorithms.

2.1 Cache

2.1.1 Overview

In contrast to random access memory (RAM) or disk, the cache is located on or muchcloser to the processor. Due to its closer proximity, it has much faster access to thedata. The close physical proximity ensures that the access speeds and energy costsare minimal, but also inhibit the cache being too large. Thus, it has to trade off itssize, being magnitudes smaller than the RAM. We can see the relationship betweensize and speed in Fig. 2.1.

The cache stores copies of the data from the main memory. Each entry in thecache corresponds to a cache line or cache block, ranging anywhere from 8 to 512bytes. In most cases, the cache line size is 64 bytes.

When accessing cached data, the latency will be much shorter than that of mainmemory. When the data that we try to access is in the cache, we result in a cache hit.On the other hand, when data is not in the cache, we have to fetch the entry from the

Disk

Memory

Cache

Register

Faster

Slower

Smaller

Larger

Figure 2.1: The memory hierarchy. As size increases, speed of access decreases.

3


4 Background

main memory and evict an existing entry. This is known as a cache miss and resultsin a main memory access, hence causing longer latency.

The proportion of accesses that results in a cache hit is known as the hit rate.Likewise, the proportion of accesses that result in a cache miss is known as the missrate. This proportion is a measure of the effectiveness of the cache. It is obvious thatwe want to reduce the number and rate of cache misses to improve the performanceof the overall system.

As processor performance continues to grow, the memory latency continues to bea major limiter to the overall performance of the system. Thus, due to its faster latency,the cache acts as an important bridge between the CPU and memory, reducing thememory latency bottleneck. However, both the size and location of the cache is adelicate balance, where increasing the cache size will increase memory latency butreduce miss rate and vice versa.

2.1.2 Replacement Policy

When data from the cache has to be evicted to make space for existing entries, weneed some form of heuristic in order to decide which entry to evict. This heuristic isknown as the replacement policy. This should predict which existing cache entry isleast likely to be used in the future.

The most popular replacement policy is Least Recently Used (LRU), which re-places the entry that was used least recently. Other replacement policies include FirstIn First Out (FIFO), where the entries are ejected in temporal order, and Last In FirstOut (LIFO) where the most recent entry is ejected.

2.1.3 Cache Misses

There are many different reasons for cache misses [7] and we can split these into 3types.

Compulsory misses, also known as cold misses, occur when accessing a blockthat has never been accessed before. These are cache misses that would occur, evenin an infinite-sized cache. This miss is known as compulsory, as it is unavoidable.

Capacity misses occur as a result of the cache being too small in size. That is, thedata requested by the program is more than the data can hold. For example, iteratingover a 5MB array repeatedly, despite the cache only being 2MB in size.

Conflict misses are those caused when the program originally had this piece ofdata in the cache, but was evicted by another piece of data.

There are multiple methods of reducing these cache misses such as prefetch-ing [13], bypassing the cache [9, 17] or cache partitioning [8, 18, 14].

2.1.4 Multi-level Caches

As mentioned before, caches require a balance between the cache latency and hit rate.Large caches will have better hit rates at the cost of longer latency and smaller cacheswill tend to have worse hit rates but much faster latency. To have the benefits of both


§2.1 Cache 5

Disk

Memory

L1 Cache

L2 Cache

L3 Cache

RegisterFaster

Slower

More CacheMisses

Less CacheMisses

Figure 2.2: Diagram of the memory hierarchy, but the cache being split into three levels. Asthe levels decrease, the size and hence the latency decreases. However, the cost is that thesmaller size causes a greater number of cache misses.

small and large caches, many computers use multiple levels of the cache. Smaller butfaster caches are backed by larger but slower caches.

Multi-level caches can have varying amounts of levels, typically two or three. TheLevel 1 cache (L1) cache refers to the smallest cache. The Level 2 cache (L2) cacherefers to the next layer of cache and so on. The last level cache (LLC) refers to thelargest cache, typically the L2 or L3 cache.

Multi-level caches operate by checking each layer by each layer. If the first layermisses, then the next layer of cache is checked, and so on before the main memory ischecked.

2.1.5 Cache Structure

There are many different types of cache structures, such as direct mapped cache, fullyassociative cache and set associative caches. The most common is the set associativecache, which is a blend of the direct mapped cache and fully associative cache.

One can think of the set associative cache as a n × m matrix, where the cache isdivided into n sets and m cache lines, also known as ways. Each memory block orentry is mapped onto a set, then placed into a cache line of the set. We can see thestructure of the cache in Fig. 2.3.

A direct mapped cache is one-way set associative. That is, it is composed of n setsand 1 cache line. On the other hand, fully associative caches have 1 set and m lines.

In Section 2.1.2, we see that the replacement policy dictates where in the cache acopy of the entry goes. In a fully associative cache, the entry or block is free to goanywhere in the cache. On the other extreme, for a direct-mapped cache, the entrycan only go in one spot. Most caches are a combination of two, and ae known asn-way set associative, where each location is caches to n locations.

2.1.6 Cache Colours

Cache colouring or page colouring assigns a particular colour to each memory page.Colours are assigned to physical addresses in sequence. Pages with different colours


6 Background

CacheLine 0

CacheLine 1

CacheLine 2

CacheLine 3

CacheLine 4CacheLine 8CacheLine 12

...

...

...

CacheLine 28

...

...

...

...

...

...

...

...

...

...

CacheLine 31

Set 1

Set 2

Set 3

Set 4

Set 5

Set 6

Set 7

Set 0

Way 0 Way 1 Way 2 Way 3 Cache Line 0Cache Line 1Cache Line 2Cache Line 3Cache Line 4Cache Line 5Cache Line 6Cache Line 7

...

...Cache Line 28Cache Line 29Cache Line 30Cache Line 31

Set 0

Set 1

Set 7

...

Block 0Block 1

...Block 7Block 8Block 9



...Block 31

Cache Memory

Figure 2.3: An example of a set-associative cache. There are 8 sets and 4 cache lines. Thiscache is then mapped to memory, with blocks corresponding to certain sets.

have different positions in CPU cache memory. Typically, when allocating sequentialpages of virtual memory, the kernel will collect pages of different colours and mapthem to virtual memory. This ensures that sequential virtual memory pages do notcontent for the same cache line.

2.2 Garbage Collectors

2.2.1 Overview

Garbage collection, also known as automatic memory management, is the processwhich recycles memory that it knows will never be used again. It avoids the need forthe programmer to explicitly allocate and deallocate memory, avoiding a whole hostof problems include memory leaks, premature frees and double frees. Additionally,it can also simplify the programming process as management of objects betweenmodules is unnecessary. As a result, this increases the productivity of programmers.

Garbage collectors [5, 10] can be split into two main components the mutator andthe collector. There may be separate threads for the collector and the mutator, or theymay be interleaved within each other. Additionally, the collector and the mutator canrun concurrently with a concurrent garbage collector, or one after the other with astop-the-world garbage collector.

2.2.2 Mutator

The mutator is the part that executes the user or application code. It performs oper-ations such as the allocation, reading and writing of objects. Though the user code


§2.2 Garbage Collectors 7

issues allocation requests, the allocator is typically considered part of the collectoritself.

Objects have different temporal characteristics [2, 1]. There are two typical groupsof objects, short-lived objects and long-lived objects. Long-lived objects, also knownas mature objects, tend to be accessed frequently. On the other hand, short-livedobjects, also known as young objects, can both be accessed frequently, such as a valuein a loop, or infrequently. However, a significantly greater proportion of objects areyoung objects in typical systems with only a small proportion of long-lived objects.

2.2.3 Collector

On the other hand, the collector is the part of the system that discovers unreachablememory and reclaims it. Though there are many different types of garbage collectors,we can split them into two styles: direct reference counting and indirect tracingcollection. Direct reference counting determines the liveliness of an object from theobject alone, without needing to trace the heap. In contrast, indirect tracing collectorsdetermine which objects are dead by tracing which objects are reachable through achain of references from a root. Most garbage collectors are tracing garbage collectors.

Tracing garbage collectors, as the name implies, requires some form of tracingthe heap and the roots. This requires going through the heap and tracing the objectgraph. As a result, this operation typically touches many objects, but rarely reusesthem, exhibiting poor temporal locality.

Additionally, tracing garbage collectors only trace through live objects. Hence, thefewer the number of live objects, the more efficient the tracing garbage collector is.

2.2.4 Algorithms

In this section, we will discuss the garbage collection algorithms that are mentionedin this report.

Semispace

The semispace [4, 6] copying algorithm is a type of tracing garbage collector that in-volves copying. Copying compacts the heap, reducing fragmentation and unlike otheralgorithms such as mark-compact, it only requires one pass over the heap. However,it also reduces the size of the available heap by half, hence reducing memory.

As seen in Fig. 2.4, the heap is divided into two equal sized semispaces. we candub the first space, space 0, the ’tospace’ and space 1 the ’fromspace’. All newobjects are allocated into the ’tospace’ during an allocation cycle. When a new objectcan no longer be allocated into the ’tospace’ as there is no more space left, the twospaces flip and the ’tospace’ is now dubbed the ’fromspace’ and vice versa. Hence,space 0 is now the ’fromspace’ and space 1 is now the ’tospace’. During a collectioncycle, we identify all live objects, copying them from the ’fromspace’ and into the’tospace. This essentially compacts the live objects together, reducing fragmentation.The dead objects in the ’fromspace’ are later zeroed for safety purposes.


8 Background

Space1

AllocationCycle

Space0

CollectionCycle

AllocationCycle

CollectionCycle

MatureObject

YoungObject

Space

Legend

Object

Copied

Figure 2.4: An example of the Semispace Collector algorithm.

When an object persists through one garbage collection cycle, they are promotedfrom young to mature. As seen in Fig. 2.4, the blue objects in space 0 are promotedto mature objects once copied. Therefore, all objects that are copied are consideredmature, and the rest are considered young.

The identification of objects as live or dead is done by tracing through all theobjects in the heap. If an objects is reachable from the roots, it is deemed live.Otherwise, if it hasn’t been reached then it is deemed dead as nothing can accessthe objects anymore. Only live objects are copied out, and dead objects persist in thefromspace until they are later zeroed. We can see the identification process illustratedin Fig. 2.5.

Reachable ObjectsRoots

A

C B

D

E

UnreachableObjects

F0

Figure 2.5: The identification of live and dead objects. All live objects are reachable from theroots. Dead objects are unreachable.


§2.2 Garbage Collectors 9

maturespace1

AllocationCycle

maturespace0

MinorCollectionCycle

AllocationCycle

MajorCollectionCycle

MatureObject

YoungObject

Space

Legend

Object

Copied

nurseryspace

Figure 2.6: An example of the Generational Copy Collector algorithm

Generational Copy

As mentioned before, tracing collectors are most efficient when there are few liveobjects in the space that the collection is operating in. Long-lived or immortal objectsare handled poorly as they will repeatedly trace and copy them from one semispaceto another.

Generational collectors [12, 15] extend upon previous collectors by separating outmature or old objects where possible from the young objects. Since young objectstend to die young, it concentrates the reclamation on young objects to maximisethe recovered space whilst minimising the copying and tracing time. Segregation ofobjects by age is done by space, where young generations are collected preferentiallyin comparison to mature objects.

Generational copy is an extension of semispace which segregates the objects intotwo different age-based spaces: the nursery space and the mature space. The maturespace consists of objects that have survived one or more garbage collection cyclewhereas the nursery space is for objects that have been recently allocated. The maturespace is further subdivided into two mature semispaces, which performs similarly tothe semispace algorithm. We can see this in Fig. 2.6.

Generational copy typically only needs to collect the nursery space, copyingobjects into the mature space if they are live and discarding the rest. However, thereare cases where it must collect the whole heap, such as when the mature space is toofull for the nursery space to copy objects into.

Since generational copy typically only needs to claim the objects from the youngergeneration, we can avoid the repeated processing of mature objects. However, thedownside is that garbage in the mature spaces cannot be reclaimed when collectingthe younger generation and thus we may have objects that are actually dead as


10 Background

their parent in the mature space is actually dead but has not been reclaimed. Thus,collection of old-objects is not prompt.


Chapter 3

Cache Pollution

In this chapter we will discuss the specific problem of cache pollution as well as howcache misses relates. For a summary of this chapter, refer to Section 3.3.

3.1 Conflict Misses

As mentioned in Section 2.1.3, there are multiple different types of cache misses [7, 11]that can occur. In this report, we will be focusing on conflict misses. Conflict misses arecache misses which could have been avoided if the cache had not evicted the entryearlier. Though some of these misses are inevitable, others can also be mitigated.

An example of this is shown in Fig. 3.1. Despite the memory location (0x00) beingneeded for the second instruction, the location is evicted to make room for 0x10.Thus, when the processor attempts to access 0x00 in the second instruction, it resultsin a conflict miss since if it didn’t replace 0x00 with 0x10 then the miss would havebeen mitigated.

3.2 Problem

Conflict misses can arise as a result of many different scenarios. One such situationof this is cache pollution, which occurs when computer programs load data of lessimportance into the cache causing other more useful data to be ejected. As mentionedbefore, there are multiple existing methods to mitigate the pollution of the cacheincluding bypassing the cache [9, 17] or cache partitioning [8, 18, 14].

Cache pollution results in data getting evicted into lower levels of the memoryhierarchy. When attempting to access the data at a later stage, we need to accessthese lower levels of cache or memory which results in performance degradation.Each cache miss can result in multiple hardware cycles being expended, growingexponentially slower every level of the hierarchy that it needs to access.

This problem is becoming more and more prominent as the ’memory wall’ [16] isgrowing increasing more apparent due to faster and faster CPUs. As CPUs speedsincrease, the disparity between CPU cycles and memory latency grows. Thus, wehave multiple different techniques used to overcome this memory latency. On thehardware side, we can control the size and location of the cache. Furthermore, on the

11


12 Cache Pollution

1/a<0x08>=b<0x10>+2

2/c<0x04>=d<0x00>

Address Data

0x08 0

0x00 3

0x04 2

Address Data

0x043

0x082

0x0a0x100x14

0x00

0120-1

MemoryCacheInstructions

1/a<0x08>=b<0x10>+2

2/c<0x04>=d<0x00>

Address Data

0x08 0->2

0x00->0x10 3->0

0x04 2

Address Data

0x043

0x082

0x0a0x100x14

0x00

0120-1


1/a<0x08>=b<0x10>+2

2/c<0x04>=d<0x00>

Address Data

0x08 2

0x10->0x00 0->3

0x04 2->3

Address Data

0x043

0x082->3

0x0a0x100x14

0x00

0->2120-1


Figure 3.1: An example of a cache miss using a program with 2 instructions. The memorylocation 0x00 is evicted from the cache in replace of 0x10, only to be later fetched.


§3.3 Summary 13

software side, programmers are given numerous techniques in order to control theway data stays in the cache. In this report, we will be utilising two such techniquescalled cache allocation technology and cache colouring which are used to partitionthe cache.

3.3 Summary

Cache pollution occurs when data is loaded into the CPU unnecessarily. This causespotentially useful data to be evicted from the cache. When later accessing the poten-tial useful data, we have to go to lower levels of the memory hierarchy to fetch it,degrading the overall performance of the system.

This report attempts to find methods to reduce the severity of cache pollution bypartitioning the cache.


14 Cache Pollution


Chapter 4

Cache Allocation Technology

In this chapter, we will discuss how cache allocation technology can be used in orderto mitigate cache misses with respect to garbage collectors. In particular, how it canbe used to help partition the caches for the mutator and collector.

4.1 Implementation

Cache Allocation Technology (CAT) is a technology implemented on most modernSkylake or greater Intel server processors for the last-level cache (LLC). It providesthe ability to control the amount of cache space consumed by an underlying hardwareprocess or posix thread.

As mentioned in Section 2.1.5, the cache can be divided into sets and ways. CATassociates these particular ways to an intermediate group called Class of Service(COS / CLOS). Each COS has a given resource capacity bitmask (CBM), which is anindication of the proportion of the cache given to by each COS. The CBMs can be seenin Fig. 4.1. Each COS is assigned a CBM, where 1 indicates that it can use that sectionor way of the cache and 0 indicating that it cannot. Furthermore, these bitmasks canoverlap with other bitmasks such as COS 0 and COS 1, or they can be isolated fromone another such as in COS 1 and COS 2.

Once the COS has been specified, threads or processes can be associated to each

1 1 1 1 1 1 11

Ways

4 5 6 7 8 9 11101 1 1 10 1 2 3

COS 0

COS 2COS 1

COS 3

0 0 0 0 0 0 001 1 1 01 1 1 1 1 1 000 0 0 10 0 0 0 1 1 110 0 0 0

Figure 4.1: Diagram of COS and CBMs for a cache with 12 ways in total.

15


16 Cache Allocation Technology

COS. Multiple threads or processes can be grouped together to a single COS. In thedefault case, all threads and processes are grouped to COS 0.

CAT can be used to ensure that a particular process or thread cannot interactin the domain of another. Hence, it provides isolation between the two, bringingsecurity as well as potential performance benefits. In particular, it can deal with theproblem of cache pollution where a particular process intrudes on the cache spaceof a higher priority one, ejecting out potentially useful cache lines and polluting thecache with useless cache lines.

4.2 The Mutator and The Collector

As mentioned in Section 2.2, garbage collectors are composed of two core components:the mutator and the collector.

The collector performs operations such as transitive closures, marking, copy, etc.,which often touches a significant number of objects but has poor reuse. Hence, thecollector is said to have poor temporal locality, causing it to pollute the cache withobjects that often are not being reused. Hence, the collector does not need the cacheas much as the mutator as only a small portion of data is reused.

On the other hand, the mutator performs operations such as reading and writingobjects. In other words, the mutator often needs to access and re-access objects inorder to modify their state. Typically, these objects have good temporal locality, beingreused frequently. Thus, the mutator depends on the data persisting in the cacheto reduce the occurrences of cache misses and thus improve the performance of themutator.

We can think of the mutator as the important process and the collector as thepolluting process. In the vanilla Jikes RVM, both the mutator and the collector areassigned the same cache. When the collector performs operations such as marking,copying and tracing, most of the cache is destroyed as a result of the large number ofobjects that it is accessing. However, the cache that it is accessing is no longer beingreused, and that data becomes useless. On the other hand, the data that is beingevicted is data that the mutator might need, which is more important than the uselessdata. Hence, we see cache pollution occurring.

We can mitigate the severity of cache pollution by assigning a proportion ofthe cache that is of suitable size to the mutator and the collector. The smaller theproportion given, the slower the process will run as it will receive more frequent cachemisses. However, since the collector exhibits poor temporal locality, the proportionof the cache need not be too large and furthermore the mutator will also benefit asthe cache is no longer being trashed. Hence, we need to balance the two proportionsassigned as reducing the proportion of the cache assigned to the collector will degradethe collectors performance but also improve the mutators performance.


§4.3 Summary 17

4.3 Summary

Cache allocation technology can be applied to assign the mutator and collector par-ticular sections of the cache. In this circumstance, we use it to restrict the amount ofcache assigned to the collector to stop it from trashing the useful parts of the cachethat the mutator needs.


18 Cache Allocation Technology


Chapter 5

Cache Colouring

In this chapter, we will discuss how cache colouring works, and how it can be usedto mitigate cache misses, particularly for objects of different ages. Additionally, wewill also discuss the different cache colouring systems for different garbage collectorssuch as Generational Copy and Semispace.

5.1 Implementation

Cache colouring, also known as page colouring, is a technology that is implementedon most caches. Cache colouring operates on pages of particular colours. Each pageof a specific colour is mapped to particular sets of the cache and cannot interact withother colours.

An example of cache colouring can be seen in Fig. 5.1. Physical memory iscomposed of pages, where contiguous chunks of memory are striped in colour. Eachpage is assigned a particular colour in order, from 0 to N, where N + 1 is the totalnumber of colours for the cache. After exhausting this, the pattern wraps around.The colour associated to each page is assigned to a particular section of the cache andcan only be accessed from that section of the cache.

Caches have varying numbers of colours, typically depending on how big thecache is as well as how many sets and ways there are. Typically, L1 caches are notcoloured, whereas L2 may have from 1 to 32 colours and L3 can have more. Hence,cache colouring typically operates on the L2 and L3 caches.

Since each page is mapped to a colour and colours are mapped to locations in thecache, we can have isolation on the colours. A page that has a colour of 0 will onlybe put in the location corresponding to colour 0 in the cache. This is useful because itensures that certain addresses do not interact with the addresses of another domain.Hence, it provides isolation between these two addresses. Like CAT, it also bringsboth potential security and performance benefits. It also helps for the scenario ofcache pollution, where particular physical addresses intrudes on the cache space ofanother.

The implementation of cache colouring is much trickier than cache allocationtechnology. Since we did not modify the kernel to directly support cache colouring,we needed some way to ensure that we were able to have guarantees on the particular

19


20 Cache Colouring

Cache Physical Memory

Cache Line 0Cache Line 1

...Cache Line 7Cache Line 8

...

...Cache Line 15Cache Lin 16

...

...Cache Line 23Cache Line 24

...

...Cache Line 31

Colour 0

Colour 1

Colour 2

Colour 3

Page 0Page 1Page 2Page 3Page 4Page 5Page 6Page 7Page 8Page 9Page 10Page 11

...

...

...

...

Figure 5.1: An example of cache colouring. Each page in physical memory is mapped to aproportion of the cache of the same colour. The cache is composed of 4 colours, with 8 waysin total.

colour. The method that we used is memory inefficient and is not as clean as directlyvirtual pages of a particular colour but works on the standard linux kernel.

When allocating normal pages into virtual memory, the operating system canback these pages with any piece of physical memory if available. These can be ofany colour, as seen in Fig. 5.2(a), where even though we allocate a contiguous strip ofvirtual memory, the colours are completely unknown.

On the other hand, when allocating using the huge page system, we allocate largechunks of virtual memory that is backed by a contiguous physical chunk of memoryaligned at a colour boundarys. Hence, we get the situation shown in Fig. 5.2(b), wherewe know that the virtual memory will be set out in a striped manner. From there, wecan later divide the huge virtual page into smaller pages of particular colours whichwe can later give away when doing particular allocations.

However, this method does have downsides. Firstly, there is additional compu-tation on the user-level side in order to perform the alignment and division processwhereas if we relied on the kernel this computation would be done on the kernelside. Secondly, we are essentially wasting memory when we have colours that are notneeded for a specific space. Since each space has a notion of the start of the virtualaddress and the end, we cannot simply give away these pages. Furthermore, since weallocated them as huge pages, we need to deallocate them as a huge page and hencecannot simply deallocate the unneeded proportion. Thus, we can use up to double ofthe total amount of memory available.

Depending on the use case and the amount of memory available on the system,


§5.2 Young and Mature Objects 21

Virtual MemoryPhysical Memory


...

...

...

...

(a) Allocation of virtual memory using stan-dard page allocation. Notice that each pageallocated can be anywhere in physical memoryand of any colour.

Virtual MemoryPhysical Memory


...

...

...

...

(b) Allocation of virtual memory using hugepage allocation. Notice that huge pages aremultiple contiguous physical pages.

Figure 5.2: Comparison of normal pages and huge pages.

this may or may not be a viable solution.

5.2 Young and Mature Objects

As we know from Section 2.2.2, most objects tend to be young. Young objects areobjects that do not survive through one garbage collection cycle. There are many ofthese objects, and typically they are infrequently accessed [2]. On the other hand,mature objects tend to be accessed more frequently. Mature objects are objects thatsurvive through at least one garbage collection cycle, whether minor or major.

We want mature objects to persist in the cache, since we know that they tend tobe reused. We do not want young objects to eject a mature object from the cache,especially a mature object that will be used frequently. Hence, we use cache colouringto isolate particular memory addresses from others, ensuring that the mature objectsand young objects are kept in a separate portion of the cache.

We can change the proportion of the cache assigned to each group of objects bymodifying the amount of colours assigned. For a 8 colour cache, if we assign 1 colourto the young objects and 7 colours to mature objects, then we have given youngobjects 1

7 of the cache and mature objects 78 of the cache.

5.2.1 Semispace

In semispace, the heap is divided into two equal semispaces, dubbed the ’tospace’and ’fromspace’. New objects are allocated into the ’tospace’. When the ’tospace’is full, then the two space flip in preparation for the collection and the ’tospace’ is


22 Cache Colouring

now dubbed the ’fromspace’ and vice versa. Live objects are identified and copiedout into the new tospace. The objects in the fromspace will be zeroed to destroythem for safety purposes. For a more detailed summary on the semispace collector’salgorithm, refer to Section 2.2.4.

In a semispace collector, each space can be composed of both young and oldobjects. The allocation request of young objects comes from the mutator, whilstmature object’s allocation request comes from the collector when they are identifiedand copied out. Hence, we can use these allocation request pathways to determinethe ages of the objects. From there, we can ensure that these objects get assigned topages of particular colours corresponding to its particular age.

5.2.2 Generational Copy

Generational copy is a generational version of semispace. The heap is divided intoa nursery space and two equal sized mature semispaces. New objects are allocatedinto the nursery space. When the nursery space is full, a minor garbage collectionhappens, and the live objects in the nursery space are copied into the mature ’tospace’.The process or reallocating into the nursery space continues until a minor garbagecollection occurs. When the ’tospace’ is full, a major garbage collection occurs andthe spaces flip like in the Semispace collector, and the live objects are copied intothe new ’tospace’. For a more detailed summary on the generational copy collector’salgorithm, refer to Section 2.2.4.

Unlike the Semispace collector, each space has a distinct age associated with it.The nursery space is only ever composed of young objects and the mature space isonly ever composed of objects that have lasted at least one garbage collection cycle.Unlike the semispace collector, we can simply ensure that each space on the heapis assigned a particular colour and can only allocate pages of that specific colour toobjects in the space.

5.3 Summary

Like cache allocation technology, cache colouring is used to partition the cache. Weuse cache colouring in order to partition the cache into two sections, one for youngobjects which are newly allocated objects that haven’t lasted through one garbage col-lection cycle yet, and one for the mature objects which are objects that have persistedthrough at least one garbage collection cycle.


Chapter 6

Experimental Methodology

This chapter gives a short breakdown of the software and hardware being used forthe evaluation. Section 6.1 discusses the software platforms being utilised, and thereason for using particular ones. Section 6.2 discusses the particular choices forhardware platforms and breaks down their major components that are necessary forthe evaluation.

6.1 Software platform

We have used Jikes RVM (Research Virtual Machine), building off commit hash38095ccd for Cache Allocation Technology and commit hash e5be2a89 for cachecolouring. These were the most recent commits when implementing each technique.

For the cache allocation technology library, we utilised Intel’s intel-cmt-cat ver-sion 2.0.0 as it supports association via posix thread and CPU process. Additionally,we use Linux version 4.15 as the operating system.

We use SemiSpace as the garbage collector for the comparison between changingway association for the collector and the mutator as it is a very simple collector.On the other hand, we chose SemiSpace and Generational Copy (GenCopy) for thegarbage collectors for cache colouring as both of them have some notion of youngobjects and mature objects. For more detailed information regarding the specificgarbage collectors, refer to Section 2.2.4.

The DaCapo [1] benchmarks will be used. Only the 2006 and 2009 benchmarkswill be used, since Jikes RVM cannot support the 2019 benchmarks as of this report.

For CAT, we utilise 12 ways as specified by Broadwell machine. For cache colour-ing, we assume that there are 8 colours in total. We choose this number to reflect theminimum number of colours between the two machines. This ensures that we canapply colouring to the L2 cache for both architectures.

6.2 Hardware platform

Table 6.1 summarises the processors used in the evaluation. We chose the first model,Broadwell, as it supports Cache Allocation Technology. However, it does not support

23


24 Experimental Methodology

Table 6.1: Processors used in our evaluation.

Architecture Broadwell Coffee Lake

Model Xeon D-1540 Core i9-9900K

Clock 2GHz 3.6GHz

Cores × SMT 8 × 2 8 × 2

L2 Cache 2MB 2MB

L2 Cache Colours 8 16

L3 Cache 12MB 16MB

L3 Cache Colours 128 128

L3 Cache Ways 12 16

Memory 16GB 32GB

proper performance statistics as it uses uncore CBox. Hence, we also decided touse the second model, Coffee Lake, as it is a more recent machine and supportsperformance statistics per process.


Chapter 7

Results

This chapter will give an overview of the results obtained from the experiments, aswell as analyse and discuss these results. Section 7.1 will discuss the results whenusing cache allocation technology. Section 7.2 will discuss the results when usingcache colouring. For a summary of this chapter, refer to Section 7.3.

7.1 Cache Allocation Technology

In order to test to efficiency of Cache Allocation Technology, we associate a numberof ways to the collector and the mutator. Since there are 12 ways in total for theBroadwell machine, we choose 12 collector ways as a baseline, denoted SS with 12collector ways. This baseline is used to calculate the speed up that we get as a resultof using Cache Allocation Technology, without factoring in the overhead costs.

We utilise the vanilla Jikes RVM as a another baseline for the results, denotedSS. This result is contrasted to the first baseline, SS with 12 collector ways, in orderto get an approximation for the overhead as a result of using the cache allocationtechnology system.

As we can see from Table 7.1, we see about a 0.6% overhead on average whenusing a heavily optimised cache allocation technology system. On the other hand, weonly see about a 0.1% increase in performance as a result of using cache allocationtechnology. Furthermore, even for benchmarks that heavily utilise the cache such asluindex, we can only see a maximum of 1% increase in performance.

This improvement is fairly small, likely due to the fact that CAT only operates onthe L3 cache. In the case of the Broadwell process, the L3 cache is 12MB, which isfairly large for the benchmarks in the DaCapo [1] suite. It would be interesting to

Table 7.1: Result of using Cache Allocation Technology. For details of the SemiSpace garbagecollector, see Section 2.2.4.

Build SS SS with 12 collector ways SS with 6 collector ways

% Time normalisedto SemiSpace (SS)

100 100.60 100.48

25


26 Results

Table 7.2: Comparison of cache colouring using SemiSpace. For details of the SemiSpacegarbage collector, see Section 2.2.4.

Build SS SS with 2 young and 6 old colours

% Time normalisedto SemiSpace (SS)

100 98.97

gather results for benchmarks that may use the L3 cache more heavily, such as themore recent 2019 DaCapo benchmarks as the benchmarks are much larger. However,those would require implementing the system on other virtual machines that supportthe benchmarks such as OpenJDK or waiting until Jikes RVM supports the 2019benchmarks.

As a result, it suffices to say that CAT does not seem too promising to reducememory latency in this case. Even with further optimisations and like, it is likely thatit will not yield very good results.

7.2 Cache Colouring

For cache colouring, we split the cache for young and mature objects, assigning afixed portion of the cache for young and the rest for mature. To be classified as amature object, it needs to persist through at least one minor garbage collection.

For the following set of results, the number of colours assigned for the young andold are the best possible combination.

7.2.1 Semispace

In Table 7.2, we see about a 1% improvement with the SemiSpace garbage collector,utilising 2 colours for young objects and 6 colours for mature objects. This means that28 = 1

4 of the cache is given to the young objects and 68 = 3

4 of the cache is assigned tomature objects.

As said in Section 7.2, the numbers chosen are the best overall. Different propor-tions of the cache assigned will result in different results.

As you can see, this improvement is much higher than the result from CacheAllocation Technology in Section 7.1. The first and most obvious reason is likely dueto the fact that cache colouring targets both the L2 and the L3 cache, where CAT onlytargets the L3 cache. As mentioned before, the L3 cache is fairly large for both of thearchitectures that we are using (12MB and 16MB). On the other hand, the L2 cacheis significantly smaller at only 2MB. As discussed in Section 2.1.3, the smaller thecache, the more susceptible to cache misses. Hence, since we can target the L2 cache,we target a cache that has a greater number of cache misses and hence reduce thisnumber.

Another reason for this is the different aspects of cache colouring and cacheallocation technology are being targeted. In the case of cache allocation technology,


§7.2 Cache Colouring 27

Table 7.3: Comparison of cache colouring using GenCopy (Generational Copy). For detailsof the GenCopy garbage collector, see Section 2.2.4.

Build GC GC with 2 young and 6 old colours

% Time normalisedto GenCopy (GC)

100 99.70

we attempt to reduce the number of cache misses by reducing cache pollution whenthe system changes from collector to mutator. On the other hand, cache colouringattempts to reduce cache misses by reducing cache pollution during mutator runtime.The number of times that the collector changes to mutator occurs at every garbagecollection cycle which is insignificant in comparison to the number of mutationsdone by the mutator. Hence, it is not surprising that we see a greater performanceimprovement using cache colouring.

7.2.2 Generational Copy

In Table 7.3, we can see about a 0.3% improvement to the overall system. LikeSemiSpace, it also uses 2 colours for the young objects and 6 colours for matureobjects. Thus, 1

4 of the cache is given to young objects and 34 of the cache is given to

mature objects.As said in Section 7.2, the proportion of the cache is assigned based on the best

results. Different proportions of the cache assigned will result in different results.We surmise that the reduced performance may be due to the difference in space

assigned to young and old objects. In GenCopy, only 15% of the heap is assigned tothe nursery space and hence young objects. On the other hand, with the SemiSpacecollector, this percentage varies as there is no fixed space for young and matureobjects. This percentage can range between 0%, when a garbage collection occurs andno objects are collected to 100%, when a garbage collection collects every object in theheap.

As a result, even though we classify mature objects as surviving at least oneminor garbage collection, it may be that mature objects in the case of GenCopy arenot mature enough. Even though they have survived one minor garbage collection,their lifetime may only be shortly after one garbage collection. Therefore, these objectsthat we have classified as mature may actually not hold the assumptions of matureobjects that we are making. Hence, like before, they are polluting the cache and weare not successfully able to partition them as we were in the SemiSpace collector.

To verify this assumption, we could look at altering the nursery size. This changein size would mean that those classified as mature would have survived throughgreater numbers of allocation.

Alternatively, we could also verify this assumption by looking at changing thecollector. Although this generational collector only has 2 generations, others havemultiple generations. With multiple generations, this gives us more fine-grained


28 Results

control over the characteristics of each generation, unlike with only 2. However, dueto time constraints, this was unfortunately not able to be verified.

7.3 Summary

The results of cache allocation technology are fairly poor, where the overhead is fairlylarge and the performance benefit is minimal. Hence, it is likely that cache allocationtechnology is not very useful in mitigating cache misses as it can only operate on theL3 cache, which is not enough to yield any perceptible performance increases.

On the other hand, the results of cache colouring are decent, where we can seethat the performance benefit is visible. Furthermore, it is also likely that with hyper-parameter tuning and greater optimisations, we would see further gains.


Chapter 8

Conclusion

As hardware continues to get faster and faster, memory latency becomes a bigger andbigger bottleneck. Hence, clever optimisations and techniques in order to mitigatecache misses and thus improve cache performance is integral. This project investi-gated potential optimisations of the cache with respect to garbage collectors utilisingtwo different methods. We showcased that cache allocation technology does not seemto be a viable optimisation as the overhead was too great in comparison to the meagreperformance gain. However, we also demonstrated that cache colouring is promisingand is able to get on average a performance boost of 1% in the case of SemiSpace and0.3% in the case of GenCopy. We also noted areas to extend this work in Section 8.1.

8.1 Future Work

8.1.1 Cache Allocation Technology

Although the results of cache allocation technology (CAT) are not promising, thereare some further investigations that could be done. As mentioned before, the mainfault of CAT lies with it only being able to operate on the last-level cache (LLC). Inthe case of the DaCapo 2006 and 2009 benchmarks, these are too small to have anyimpact. However, as mentioned in Section 7.1, the DaCapo 2019 benchmarks are morerecent and are larger. Hence, it may be possible to see some potential performanceimprovements.

8.1.2 Cache Colouring

Although we have shown that cache colouring has potential to mitigate cache misses,we can further improve this by exploring different parameters as noted in Section 7.3.These parameters could include changing the garbage collector such as using G1 orGenerational Immix, or even varying the size of different spaces such as altering thenursery space in the case of Generational Copy as mentioned in Section 7.2.

Furthermore, we can also investigate how cache colouring performs on differentprocessors and in particular machines with different L2 and L3 sizes than the onestested. This will shed more light on how the L2 and L3 cache interacts and benefitsfrom cache colouring.

29


30 Conclusion

Additionally, we have only implemented cache colouring on Jikes RVM, which isa research virtual machine. If we were to test the commerical capabilities, it wouldbe interesting to implement cache colouring for commerical virtual machines such asOpenJDK and see whether the results are similar to those of Jikes RVM.

As mentioned before, our cache colouring system is not the most optimal. InSection 5.1, we mention that we can also modify the kernel to directly get pages ofparticular colours, reducing both memory costs as well as potential performance costs.For users that do not mind modifying the kernel, this may be a potential solution toreduce memory and time costs.


Chapter 8

Appendix

The following documents contain: copy of study contract, project summaries anddescription of software/artefacts.

For information regarding experiments please see Chapter 6. For informationregarding what files have been edited and licenses, etc, please see

https://gitlab.anu.edu.au/brendawang/comp3770/comp3770

8.1.3 Project Structure

This is the main repository for COMP3770. The project is broken down into 2 smallerprojects, cat for cache allocation technology and cc for cache colouring. The commonfolder is for projects and scripts that are common for both projects, such as running,gen-advice, etc. The admin folder holds information regarding the meetings, studycontract as well as slides. The report folder contains the report.

Inside cc is jikesrvm-cc, which is a clone of JikesRVM with cache colouringadded to it as well as scripts which contain some scripts to help run code.

Inside cat is jikesrvm-cat, which is a clone of JikesRVM with cache allocationtechnology added to it, as well as kernel_mod which is an implementation of thecache allocation technology kernel module and scripts which contain CAT specificscripts to help run code.

8.1.4 Testing

Building JikesRVM

To build JikesRVM use the code:

cd <jikesrvm_directory>;bin/buildit <host eg. localhost, vole.moma> -j <java_home> <jikesrvm build>

Benchmark

The benchmark used for the project can be found at http://dacapobench.sourceforge.net/.To run a particular benchmark run the code:

31


32 Appendix

Research School of Computer Science Form updated Jun 2018

INDEPENDENT STUDY CONTRACT

PROJECTS Note: Enrolment is subject to approval by the course convenor

SECTION A (Students and Supervisors)

UniID: u6374399 .

SURNAME: Wang FIRST NAMES: Brenda .

PROJECT SUPERVISOR (may be external): Yuval Yarom .

FORMAL SUPERVISOR (if different, must be an RSSCS academic): Stephen M. Blackburn .

COURSE CODE, TITLE AND UNITS: COMP3770 .

COMMENCING SEMESTER S1 S2 YEAR: 2019 . Two-semester project (12u courses only):

PROJECT TITLE:

Optimising Garbage Collection using modern hardware mechanisms

LEARNING OBJECTIVES:

Learn about and utilise modern hardware mechanisms and components

Work with and understand the intricacies of garbage collectors

Experience working in and modifying a research virtual machine

PROJECT DESCRIPTION:

This project will involve improving the performance of garbage collectors by using modern x86 hardware mechanisms.

Cache Allocation Technology (CAT) is used to split up the cache into multiple ways that can be assigned to multiple

hyper-threads. This can be used to stop threads from polluting the whole cache and slowing down performance of the high

priority processes such as the mutator. This may be useful for both zeroing and collecting, since both have high cache

footprint as it heavily accesses memory but has little reuse.

Cache Colouring splits up the cache into multiple sets which are spatially assigned. Objects that are infrequently accessed

can be assigned to regions corresponding to a small portion of the cache, whilst those that are frequently accessed can be

assigned to regions associated with a large section of the cache. This technique may be beneficial for generational

collectors since younger objects tend to have a higher presence in cache but may only be accessed a few times. Thus, a

prenursery may be useful to weed out young, infrequently accessed objects.

Hardware Transactional Memory (HTM) such as Intel Transaction Synchronization Extensions (TSX) can be used as a roll

back technique. This can be used as a replacement for traditional yieldpoint techniques such as conditional or trap-based

approaches.

Memory Protection Keys (MPK) are a way of assigning multiple pages to a particular key. For collectors that that work on

memory protected regions such as C4 or The Compressor, this may reduce the expensive transaction cost of individually

reassigning protections for each page separately. Furthermore, this may be useful for thread-local collectors for the

identification of non-private objects.

Figure 8.1: A copy of the Independent Study Contract


33

5HVHDUFK6FKRRORI&RPSXWHU6FLHQFH )RUPXSGDWHG-XQ

$66(660(17DVSHUWKHSURMHFWFRXUVH¶VUXOHVZHESDJHZLWKDQ\GLIIHUHQFHVQRWHGEHORZ

$VVHVVHGSURMHFWFRPSRQHQWV RIPDUN 'XHGDWH (YDOXDWHGE\

5HSRUWVW\OH3DSHUHJUHVHDUFKUHSRUWVRIWZDUHGHVFULSWLRQ

PLQGHI

H[DPLQHU

$UWHIDFWNLQG6RIWZDUHHJVRIWZDUHXVHULQWHUIDFHURERW

PD[GHI

VXSHUYLVRU

3UHVHQWDWLRQ FRXUVHFRQYHQRU

0((7,1*'$7(6,).12:1:HHNO\

678'(17'(&/$5$7,21,DJUHHWRIXOILOWKHDERYHGHILQHGFRQWUDFW

««««««««««««««««««« «««««««««6LJQDWXUH 'DWH

6(&7,21% 6XSHUYLVRU

,DPZLOOLQJWRVXSHUYLVHDQGVXSSRUWWKLVSURMHFW,KDYHFKHFNHGWKHVWXGHQWVDFDGHPLFUHFRUGDQGEHOLHYHWKLVVWXGHQWFDQFRPSOHWHWKHSURMHFW ,QRPLQDWHWKHIROORZLQJH[DPLQHUDQGKDYHREWDLQHGWKHLUFRQVHQWWRUHYLHZWKHUHSRUWYLDVLJQDWXUHEHORZRUDWWDFKHGHPDLO

«««««««««««««««««« «««««««««6LJQDWXUH 'DWH

ǆĂŵŝŶĞƌEĂŵĞ«««««««««««««««« ^ŝŐŶĂƚƵƌĞ ««««««««;EŽŵŝŶĂƚĞĚĞǆĂŵŝŶĞƌƐŵĂǇďĞƐƵďũĞĐƚƚŽĐŚĂŶŐĞŽŶƌĞƋƵĞƐƚďǇƚŚĞƐƵƉĞƌǀŝƐŽƌŽƌĐŽƵƌƐĞĐŽŶǀĞŶŽƌͿ

5(48,5(''(3$570(175(6285&(6

6(&7,21&&RXUVHFRQYHQRU DSSURYDO

««««««««««««««««««« «««««««««6LJQDWXUH 'DWH

Figure 8.2: A copy of the Independent Study Contract


34 Appendix

cd <jikesrvm_directory>;dist/<build>/rvm -jar dacapo-2006-10-MR2.jar <benchmark name>

8.1.5 Project Summaries

Cache Allocation Technology (CAT)

Cache Allocation Technology is a technology implemented on most modern Skylakeor greater intel machines. The cache is divided into sets and ways. CAT associatesthese particular ways to a particular Class of Service (COS). Hence, a particular COSwill only be able to use a particular set of ways. These COS are then associated toeither an underlying hardware process, or even a posix thread.

This is useful because it ensure that a particular process or thread cannot interactin the domain of another. Hence, it provides isolation between the two, bringing se-curity as well as potential performance benefits. Cache pollution is when a particularprocess intrudes on the cache space of another, ejecting out potentially useful cachelines and polluting the cache with useless cache lines.

Since CAT is able to isolate threads, we can use it to isolate the mutator andthe collector from each other, placing them on separate ways. The collector doesoperations such as transitive closure, marking, copying, etc, which often touches asignificant number of objects but often does not reuse them. On the other hand, themutator typically accesses objects again, and has a set of heavily used objects. Hence,we can think of the mutator as the primary process and the collector as the pollutingprocess or ‘noisy neighbour’.

Cache Colouring (CC)

Cache Colouring is a technology that is implemented on most caches, both AMD andIntel. The cache is divided into sets and ways. Cache Colouring operates on thesesets, where particular physical addresses are mapped to particular sets. Hence, aparticular colour is mapped to particular sets and are not able to interact with othercolours. The colour is mapped to a particular physical address and often each pageis of a different colour. A contiguous chunk of physical memory is striped in colour.

This is useful because it ensures that certain addresses do not interact with the ad-dresses of another domain. Hence, it provides isolation between these two addresses.Like CAT, it also brings both potential security and performance benefits. Like CAT,it also helps for cache pollution, where particular physical addresses intrudes on thecache space of another.

Since CC is able to ensure that particular memory addresses are isolated fromothers in the cache, we can use this to ensure that addresses associated with particularobjects are separate from other objects. One situation this may be useful in is theseparation between mature and nursery objects. Young objects live in the nurseryspace and typically are infrequently accessed before dying. On the other hand, olderobjects live in the mature space and are accessed frequently in contrast to youngobjects on average.


Bibliography

[1] Blackburn, S. M.; Garner, R.; Hoffmann, C.; Khang, A. M.; McKinley, K. S.;Bentzur, R.; Diwan, A.; Feinberg, D.; Frampton, D.; Guyer, S. Z.; Hirzel, M.;Hosking, A.; Jump, M.; Lee, H.; Moss, J. E. B.; Phansalkar, A.; Stefanovic, D.;VanDrunen, T.; von Dincklage, D.; and Wiedermann, B., 2006. The dacapobenchmarks: Java benchmarking development and analysis. In Proceedings of the21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems,Languages, and Applications, OOPSLA ’06 (Portland, Oregon, USA, 2006), 169–190.ACM, New York, NY, USA. doi:10.1145/1167473.1167488. http://doi.acm.org/10.1145/1167473.1167488. (cited on pages 7, 23, and 25)

[2] Blackburn, S. M. and McKinley, K. S., 2007. Transient caches and objectstreams. cluster computing, (2007). (cited on pages 1, 2, 7, and 21)

[3] Boehm, H.-J., 2000. Reducing garbage collector cache misses. SIGPLAN Not., 36,1 (Oct. 2000), 59–64. doi:10.1145/362426.362438. http://doi.acm.org/10.1145/362426.362438. (cited on page 2)

[4] Cheney, C. J., 1970. A nonrecursive list compacting algorithm. Commun. ACM,13, 11 (Nov. 1970), 677–678. doi:10.1145/362790.362798. http://doi.acm.org/10.1145/362790.362798. (cited on page 7)

[5] Dijkstra, E. W.; Lamport, L.; Martin, A. J.; Scholten, C. S.; and Steffens, E.F. M., 1978. On-the-fly garbage collection: An exercise in cooperation. Commun.ACM, 21, 11 (Nov. 1978), 966–975. doi:10.1145/359642.359655. http://doi.acm.org/10.1145/359642.359655. (cited on page 6)

[6] Fenichel, R. R. and Yochelson, J. C., 1969. A lisp garbage-collector for virtual-memory computer systems. Commun. ACM, 12, 11 (Nov. 1969), 611–612. doi:10.1145/363269.363280. http://doi.acm.org/10.1145/363269.363280. (cited on page7)

[7] Hill, M. D. and Smith, A. J., 1989. Evaluating associativity in cpu caches.IEEE Trans. Comput., 38, 12 (Dec. 1989), 1612–1630. doi:10.1109/12.40842. https://doi-org.virtual.anu.edu.au/10.1109/12.40842. (cited on pages 1, 4, and 11)

[8] Hill, M. D. and Smith, A. J., 1989. Evaluating associativity in cpu caches. IEEETrans. Computers, 38 (1989), 1612–1630. (cited on pages 2, 4, and 11)

[9] Johnson, T. L.; Connors, D. A.; Merten, M. C.; and Hwu, W.-M., 1999. Run-time cache bypassing. IEEE Transactions on Computers, 48, 12 (1999), 1338–1354.(cited on pages 2, 4, and 11)

35


http://dx.doi.org/10.1145/1167473.1167488

http://doi.acm.org/10.1145/1167473.1167488

http://doi.acm.org/10.1145/1167473.1167488

http://dx.doi.org/10.1145/362426.362438

http://doi.acm.org/10.1145/362426.362438

http://doi.acm.org/10.1145/362426.362438

http://dx.doi.org/10.1145/362790.362798

http://doi.acm.org/10.1145/362790.362798

http://doi.acm.org/10.1145/362790.362798

http://dx.doi.org/10.1145/359642.359655

http://doi.acm.org/10.1145/359642.359655

http://doi.acm.org/10.1145/359642.359655

http://dx.doi.org/10.1145/363269.363280

http://dx.doi.org/10.1145/363269.363280

http://doi.acm.org/10.1145/363269.363280

http://dx.doi.org/10.1109/12.40842

https://doi-org.virtual.anu.edu.au/10.1109/12.40842

https://doi-org.virtual.anu.edu.au/10.1109/12.40842

36 Bibliography

[10] Jones, R.; Hosking, A.; and Moss, E., 2011. The Garbage Collection Handbook: TheArt of Automatic Memory Management. Chapman & Hall/CRC, 1st edn. ISBN1420082795, 9781420082791. (cited on page 6)

[11] Jouppi, N., 1990. Reducing compulsory and capacity misses. (01 1990). (citedon page 11)

[12] Lieberman, H. and Hewitt, C., 1983. A real-time garbage collector basedon the lifetimes of objects. Commun. ACM, 26, 6 (Jun. 1983), 419–429. doi:10.1145/358141.358147. http://doi.acm.org/10.1145/358141.358147. (cited on page9)

[13] Patterson, R. H., 1997. Informed prefetching and caching. Technical report,CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCI-ENCE. (cited on page 4)

[14] Reddy, V. K.; Sawyer, R. K.; and Gehringer, E. F., 2006. A cache-pinningstrategy for improving generational garbage collection. In High PerformanceComputing - HiPC 2006, 98–110. Springer Berlin Heidelberg, Berlin, Heidelberg.(cited on pages 2, 4, and 11)

[15] Ungar, D., 1984. Generation scavenging: A non-disruptive high performancestorage reclamation algorithm. In Proceedings of the First ACM SIGSOFT/SIGPLANSoftware Engineering Symposium on Practical Software Development Environments,SDE 1, 157–167. ACM, New York, NY, USA. doi:10.1145/800020.808261. http://doi.acm.org/10.1145/800020.808261. (cited on page 9)

[16] Wulf, W. A. and McKee, S. A., 1995. Hitting the memory wall: implicationsof the obvious. ACM SIGARCH computer architecture news, 23, 1 (1995), 20–24.(cited on page 11)

[17] Yang, X.; Blackburn, S. M.; Frampton, D.; Sartor, J.; and McKinley,K. S., 2011. Why nothing matters: The impact of zeroing. ACM SIG-PLAN Conference on Object-Oriented Programming, Systems, Languages, andApplications (OOPSLA). https://www.microsoft.com/en-us/research/publication/why-nothing-matters-the-impact-of-zeroing/. (cited on pages 2, 4, and 11)

[18] Yarom, Y.; Ge, Q.; Liu, F.; Lee, R. B.; and Heiser, G., 2015. Mapping the intellast-level cache. IACR Cryptology ePrint Archive, 2015 (2015), 905. (cited on pages2, 4, and 11)

[19] Zorn, B., 1991. The effect of garbage collection on cache performance. Technicalreport. (cited on page 1)


http://dx.doi.org/10.1145/358141.358147

http://dx.doi.org/10.1145/358141.358147

http://doi.acm.org/10.1145/358141.358147

http://dx.doi.org/10.1145/800020.808261

http://doi.acm.org/10.1145/800020.808261

http://doi.acm.org/10.1145/800020.808261

https://www.microsoft.com/en-us/research/publication/why-nothing-matters-the-impact-of-zeroing/

https://www.microsoft.com/en-us/research/publication/why-nothing-matters-the-impact-of-zeroing/

Documents

Optimising the Garbage Collector using the Cache€¦ · Optimising the Garbage Collector using the Cache Brenda Wang A report submitted for the course ... 2.4 An