cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion

CloudCache:Expanding and Shrinking Private Caches

CloudCache:Expanding and Shrinking Private Caches

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

University of Pittsburgh

Credits

Parts of the work presented in this talk are from the results obtained in collaboration with students and faculty at the University of Pittsburgh:• Mohammad Hammoud

• Lei Jin

• Hyunjin Lee

• Kiyeon Lee

• Prof. Bruce Childers

• Prof. Rami Melhem

Partial support has come as:• NSF grant CCF-0702236

• NSF grant CCF-0952273

• A. Richard Newton Graduate Scholarship, ACM DAC 2008


Recent multicore design trends

Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors may look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]


Recent multicore design trends

Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors will look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]

Design focus shifts from cores to “uncores”• Network on-chip (NoC): bus, crossbar, ring, 2D/3D mesh, …• Memory hierarchy: caches, coherence mechanisms, memory

controllers, …• System-on-chip components like network MAC, cypher, …

Note also:• In this design, aggregate cache capacity increases as we add more tiles• Some tiles may be active while others are inactive


L2 cache design issues

L2 caches occupy a major portion (prob. the largest) on chip

Capacity vs. latency• Off-chip access (still) expensive

• On-chip latency can grow; with switched on-chip networks, we have NUCA (Non-Uniform Cache Architecture)

• Private cache vs. shared cache or a hybrid scheme?

Quality of service• Interference (capacity/bandwidth)

• Explicit/implicit capacity partitioning

Manageability, fault and yield• Not all caches have to be active

• Hard (irreversible) faults + process variation + soft faults


Private cache vs. shared cache

Short hit latency (always local)

High on-chip miss rate

Long miss resolution time (i.e., are there replicas?

Low on-chip miss rate

Straightforward data location

Long average hit latency

Uncontrolled interference


Historical perspective

Private designs were there• Multiprocessors based on microprocessors

Shared cache clustering [Nayfeh and Olukotun, ISCA ‘94]

Multicores with private design more straightforward• Earlier AMD chips and Sun Microsystems’ UltraSPARC processors

Intel, IBM, and Sun pushed for shared caches (e.g., Xeon products, POWER-4/5/6, Niagara*)• A variety of workloads perform relatively well

More recently, private caches re-gain popularity• We have more transistors• Interference is easily controlled• Match well with on-/off-chip (shared) L3 $$ [Oh et al., ISVLSI ‘09]


Shared $$ issue 1: NUCA

A compromise design• Large monolithic shared cache: conceptually simple but very slow

• Smaller cache slices w/ disparate latencies [Kim et al., ASPLOS ‘02]

Interleave data to all available slices (e.g., IBM POWER-*)

Problems• Blind data distribution with no provisions for latency hiding

• As a result, simple NUCA performance is not scalable

yourprogram

yourdata


Improving program-data distance

Key idea: replicate or migrate

Victim replication [Zhang and Asanović, ISCA ’05]

• Victims from private L1 cache are “replicated” to local L2 bank

• Essentially, local L2 banks incorporate large victim caching space

• A natural hybrid scheme that takes advantages of shared & private $$

Adaptive Selective Replication [Beckmann et al., MICRO ‘06]

• Uncontrolled replication can be detrimental (shared cache degenerates to private cache)

• Based on benefit and cost, control replication level

• Hardware support quite complex

Adaptive Controlled Migration [Hammoud, Cho, and Melhem, HiPEAC ‘09]

• Migrate (rather than replicate) a block to minimize average access latency to this block based on Manhattan distance

• Caveat: book-keeping overheads


3-way communication

Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

B

home tile of B

new host of B

another requester

1

3

2


3-way communication

Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

B

home tile of B

new host of B

requester

1

3

2


Cache block location strategies

Always-go-home• “Go to the home tile (directory) to find the tracking information”

• Lots of 3-way cache-to-cache transfers

Broadcasting• “Don’t remember anything”

• Expensive

Caching of tracking information at potential requesters• “Remember (locally) where previously accessed blocks are”

• Quickly locate the target block locally (after first miss)

• Need maintenance (coherence) of distributed information

• We proposed and studied one such technique [Hammoud, Cho, and Melhem, HiPEAC ‘09]


Caching of tracking information

…B

home tile of BL2$$ Dir

.

TrT

Rcore

B primarytracking info.

b

requester tile for B

L2$$ Dir

.

TrT

Rcore

tag ptr

replicatedtracking info.

Primary tracking information at home tile• Shows where the block has been migrated to

• Keeps track of who has copied the tracking information (bit vector)

Tracking information table (TrT)• A copy of the tracking information (only one pointer)

• Updated as necessary (initiated by the directory of the home tile)

B


A flexible data mapping approach

Key idea: What if we distribute memory pages to cache banks instead of memory blocks? [Cho and Jin, MICRO ‘06]

Memory blocks Memory pages


Observation

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Software maps data to different $$

Key: OS page allocation policies

Flexible


Shared $$ issue 2: interference

Co-scheduled programs compete for cache capacity freely (“capitalism”)

Well-behaving programs get damages if another program vie for more cache capacity w/ little reuse (e.g., streaming)

Performance loss due to interference hard to predict

on 8-core Niagara

2×


Explicit/implicit partitioning

Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB

• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 3 vs. 2 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

…

3 ways to program A 2 ways to program B


Explicit/implicit partitioning

Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB

• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 5 vs. 3 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

Implicit partitioning based on utility (“utilitarian”)• Way partitioning based on marginal gains [Suh et al., ICS ‘01][Qureshi et al.,

MICRO ‘06]

• Bank level [Lee, Cho and Childers, HPCA ’10]

• Bank + way level [Lee, Cho and Childers, HPCA ’11]


Explicit partitioning


Implicit partitioning

(cache capacity) (cache capacity)


Private $$ issue: capacity borrowing

The main drawback of private caches is fixed (small) capacity Assume that we have an underlying mechanism for enforcing

correctness of cache access semantics; Key idea: borrow (or steal) capacity from others

• How much capacity do I need and opt to borrow?• Who may be a potential capacity donor?• Do we allow performance degradation of a capacity donor?

Cooperative Caching (CC) [Chang and Sohi, ISCA ‘06]

• Stealing coordinated by a centralized directory (is not scalable)

Dynamic Spill Receive (DSR) [Qureshi, HPCA ‘09]

• Each node is a spiller or a receiver• On eviction from a spiller, randomly choose a receiver to copy the data into;

then, for data access later, use broadcasting

StimulusCache [Lee, Cho, and Childers, HPCA ‘10]

• It’s likely we’ll have “excess cache” in the future• Intelligently share the capacity provided by excess caches based on utility


Core disabling uneconomical

Many unused (excess) L2 caches exist

Problem exacerbated with many cores

0%

5%

10%

15%

20%

25%

30%

35%

40%

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

probab

ility

# of sound cores/ L2 caches

L2 cache

processing logic

core (L2 + proc. Logic)

32 core

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8

Probab

ility

# of sound cores/ L2 caches

8 core L2 cache

processing logic

core (L2 + proc. logic)


Flow-in#N: # data blocks flowing to EC#N

Hit#N: # data blocks hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2


Hit/Flow-in ↑ More ECsHit/Flow-in ↓ Less ECs


L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: At least 1 EC No harmful effect on EC#2

Allocate 2 ECs


EC #1

Mainmemory

EC #2

Hits#1


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: At least 2 ECs

Allocate 2 ECs


EC #1

Mainmemory

EC #2

Hits#2

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: At least 1 ECHarmful effect on EC#2

Allocate 1 EC


EC #1

Mainmemory

EC #2

Hits#1

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: No benefit with ECs

Allocate 0 EC


EC #1

Mainmemory

EC #2

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilization

Minimized capacity interference


Mainmemory

EC #1 EC #2

2

2

1

0

EC#


Summary: priv. vs. shared caching

For a small- to medium-scale multicore processor, shared caching appears good (for its simplicity)• A variety of workloads run well

• Care must be taken about interferences

• Because of NUCA effects, this approach is not scalable!

For larger-scale multicore processors, private cache schemes appear more promising• Free performance isolation

• Less NUCA effect

• Easier fault isolation

• Capacity stealing a necessity


Summary: hypothetical caching

(256kB L2 cache slice)

Rel

ativ

e pe

rfor

man

ceto

sha

red

cach

ing

(w/o

pro

filin

g)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

gzip vpr gcc mcf

crafty

parse r eo

ngap

vorte

xbzip

2tw

olf

wupwise swim

mgrid mesa art

equake

ammp

g-mea

n

private w/ profilingshared w/ profilingstatic 2D coloring

Tackle both miss rate and locality through judicious data mapping! [Jin and Cho, ICPP ‘08]


0

50

100

150

200

250

300

350

400

450

Summary: hypothetical caching

# pa

ges

map

ped

(256kB L2 cache bank)

Program location

GCC

XY


Summary: hypothetical caching alpha 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

ammp

art

bzip2

crafty

eon

equake

gap

gcc

gzip

mcf

mesa

mgrid

parser

swim

twolf

vortex

vpr

wupwise

private caching shared caching(256kB L2 cache bank)


In the remainder of this talk

CloudCache• Private cache based scheme for large-scale multicore processors

• Capacity borrowing at way/bank level & distance awareness


A 30,000-feet snapshot

1

2

34

5

6

8

9

7

Cache cloud Working core

High hit rate& fast access

Low hit rate& slow access

C

C Working core

Each cloud is an exclusive private cache for a working core

Clouds are built with nearby cache ways organized in a chain

Clouds are built in a two-step process: Dynamic global partitioning and cache cloud formation


Step 1: Dynamic global partitioning

Global capacity allocator (GCA) runs a partitioning algorithm periodically using “hit counts” information• Each working core sends GCA hit counts from “allocated capacity” and

additional “monitoring capacity” (of 32 ways)

Our partitioning considers both utility [Qureshi & Patt, MICRO ‘06]

and QoS

This step computes how much capacity to give to each core

networknetwork

GCA

Hit countsCache

alloc. info

…

from allocated capacity

from monitoring capacity

Allocationengine

Allocationengine

…

Counter buffer

GCA


0.75

1. Local L2 cache first

2. Threads w/ larger capacity demand first

3. Closer L2 banks first

4. Allocate capacity as much as possible

Allocation is done only for sound L2 banks

2.751.75

Step 2: Cache cloud formation

0

3

1

2

0

1

2.75

1.25

Goal Capacity to allocate

1.250.25

0

0

Repeat!


Cache cloud chaining example

1 2

3 4 5

6 7 8

4 1 5 7 3 6 8 24 1 5 7 3 6 8 2

8 6 6 4 4 3 2 2

MRU LRU

Cloud capacity

Core ID

Token count

35

Working core 4

2

6

Working core 2

0-hop distance1-hop distance 2-hop distance


Architectural support

L2router

Dir

L1 Proc

Global capacity allocator

L2router

Dir

L1 Proc

Global capacity allocatorGlobal capacity allocator

L2

Router

Dir

L1 Proc

Cloud table

Monitor tags

Hit counters

Home Core ID Token #Next Core ID



Capacity allocation

Cache cloud chaining

Cloud table resembles StimulusCache’s NECP

Additional hardware is per-way hit counters and monitor tags

We do not need core ID tracking information per block


Capacity allocation example


Seven working cores10.5

812.625

0.75

9.625

14.5

8




10.5

812.625

0.75

9.625

14.5

8

13.125

8.5

7.875

7

1111.5

6

Repartitionevery ‘T’ cycles




13.125

8.57.875

7

11

11.5

6


CloudCache is fast?

Remote L2 access has 3-way communication


(1) Directory lookup

(2) Request forwarding

(3) Data forwarding

Distance-aware cachecloud formation tacklesonly (3)!


Limited target broadcast (LTB)

Make common case “super fast” and rare case “not so fast”


Private data: Limited target broadcast No wait for directory lookup

Shared data:Directory-based coherence

Private data ≫ Shared data


LTB example

w/o broadcasting1 1

2 3 4 5

6

2 3 4 5

11

1 6w/ broadcasting

Dir. request BroadcastLocal L2 hit

Dir. response Broadcast hit

time

time


When is LTB used?

Shared data: Access ordering is managed by directory• LTBP is NOT used

Private data: Access ordering is not needed• Fast access with broadcast first, then notify directory

Race condition?• When there is an access request for private data from a non-owner core

before directory is updated


LTB protocol (LTBP)

RR

WW BB

start

1

2

3

4

Input Action

1 Request from other cores Broadcast lock request

2 Ack from the owner Process non-owner request

3 Nack from the owner

4 Request from the ownerR: ReadyW: WaitB: Busy

BLBL BUBU

BU: Broadcast unlockBL: Broadcast lock

Directory

L2 cache1

2

Input Action

1 Broadcast lock request Ack

2 Invalidation


Quality of Service (QoS) support

QoS Maximumperformancedegradation samplingfactor, 32inthiswork privatecachecapacity, 8inthiswork currentlyallocatedcapacity

Base cycle: Cycles (time) with private cache (estimated)

.Basecyclecurrentcycle ∑ currentcyclecurrentcycle ∑

Estimated cycle: Cycles with cache capacity ‘j’

Estimatedcycle j Basecycle ∑ .

Min j s. t. Estimatedcycle j / 1 QoS Basecycle


DSR vs. ECC vs. CloudCache

Dynamic Spill Receive [Quresh, HPCA ’09]

• Node is either spiller or receiver depending on memory usage

• Spiller nodes randomly “spill” evicted data to a receiver node

Elastic Cooperative Caching [Herrero et al., ISCA ‘10]

• Each node has private area and shared area

• Nodes with high memory demand can spill evicted data to shared area in a randomly selected node


Experimental setup

TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

• 64-core CMP with 8×8 2D mesh, 4-cycles/hop

• Core: Intel’s ATOM-like two-issue in-order pipeline

• Directory-based MESI protocol

• Four independent DRAM controllers, four ports/controller

• DRAM with Samsung DDR3-1600 timing

Workloads• SPEC2006 (10B cycles)

High/medium/low based on MPKI for varying cache capacity

• PARSEC (simlarge input set)16 threads/application


Impact of global partitioning

High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI


L2 cache access latency

Access # 401.bzip2

Private

DSR

Shared

0%

50%

100%

CloudCache w/o LTB0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

ECC0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+060%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

0 100 200 300

CloudCache w/ LTB

access latency

0.E+00

1.E+06

2.E+06

0 100 200 300

Shared: widely spread latency

Private: fast local access + off-chip access

DSR & ECC: fast local access + widely spread latency

CloudCache w/o LTBfast local access + narrowly spread latency

CloudCache w/ LTBfast local access + fast remote access


16 threads

1


32 and 64 threads


PARSEC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Comb1 Comb2 Comb3 Comb4 Comb5

spee

du

p t

o p

riva

te

Shared DSR ECC CLOUD


PARSEC

0

0.5

1

1.5

2

swaption blacksch. canneal bodytrack swaption facesim canneal ferret

Comb2 Comb5

spee

du

p t

o p

riva

te

Shared DSR ECC CLOUD


Effect of QoS enforcement

0.9

1

1.1

1.2

1.3

1.4

1.5

1 21 41 61 81 101 121 141 161 181

speedup to private

No QoS 98% QoS 95% QoS

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

1 11 21 31 41 51 61 71 81 91

different programs


CloudCache summary

Future processors will carry many more cores and cache resources and future workloads will be heterogeneous

Low-level caches must become more scalable and flexible

We proposed CloudCache w/ three techniques:• Global “private” capacity allocation to eliminate interference and to

minimize on-chip misses

• Distance-aware data placement to overcome NUCA latency

• Limited target broadcast to overcome directory lookup latency

Proposed techniques synergistically improve performance

Still, we find that global capacity allocation the most effective

QoS is naturally supported in CloudCache


Our multicore cache publications Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level

Page Allocation,” MICRO 2006. Best paper nominee. Cho, Jin and Lee, “Achieving Predictable Performance with On-Chip Shared

L2 Caches for Manycore-Based Real-Time Systems,” RTCSA 2007. Invited paper.

Hammoud, Cho and Melhem, “ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors,” HiPEAC 2008.

Hammoud, Cho and Melhem, “Dynamic Cache Clustering for Chip Multiprocessors,” ICS 2008.

Jin and Cho, “Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches,” ICPP 2008.

Oh, Lee, Lee, and Cho, “An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor,” ISVLSI 2009.

Jin and Cho, “SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors,” PACT 2009.

Lee, Cho and Childers, “StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache,” HPCA 2010.

Hammoud, Cho and Melhem, “Cache Equalizer: A Placement Mechanism for Chip Multiprocessor Distributed Shared Caches,” HiPEAC 2011.

Lee, Cho and Childers, “CloudCache: Expanding and Shrinking Private Caches,” HPCA 2011.

Documents

cloudcache.pptx [Last saved by user]people.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_2up.pdf · University of Pittsburgh L2 cache design issues L2 caches occupy a major portion