Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi,

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures

Nikos Hardavellas, Northwestern University

Team: M. Ferdman, B. Falsafi, A. AilamakiNorthwestern, Carnegie Mellon, EPFL

© Hardavellas2

Moore’s Law Is Alive And Well

90nm

90nm transistor(Intel, 2005)

Swine Flu A/H1N1(CDC)

65nm2007

45nm2010

32nm2013

22nm2016

16nm2019

Device scaling continues for at least another 10 years

© Hardavellas3

Good Days Ended Nov. 2002

[Yelick09]

“New” Moore’s Law: 2x cores with every generationOn-chip cache grows commensurately to supply all cores with data

Moore’s Law Is Alive And Well

© Hardavellas4

1

10

100

1,000

10,000

100,000

1990 2000 2010Year

L2

Cac

he

Siz

e (K

B)

0

5

10

15

20

25

1990 2000 2010Year

L2

Hit

Lat

ency

(cy

cles

)

slow access

large caches

Larger Caches Are Slower Caches

Increasing access latency forces caches to be distributed

© Hardavellas5

Cache design trends

Balance cache slice access with network latency

As caches become bigger, they get slower:

Split cache into smaller “slices”:

© Hardavellas6

corecorecore

Modern Caches: Distributed

Split cache into “slices”, distribute across die

L2 L2 L2 L2

L2 L2 L2 L2

core

core core core core

7

Data Placement Determines Performance

© Hardavellas

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Goal: place data on chip close to where they are used

cacheslice

© Hardavellas8

Our proposal: R-NUCAReactive Nonuniform Cache Architecture

• Data may exhibit arbitrarily complex behaviors ...but few that matter!

• Learn the behaviors at run time & exploit their characteristics Make the common case fast, the rare case correct Resolve conflicting requirements

9 © Hardavellas

Reactive Nonuniform Cache Architecture

• Cache accesses can be classified at run-time Each class amenable to different placement

• Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Up to 32% speedup (17% on average)

-5% on avg. from an ideal cache organization

• Rotational Interleaving Data replication and fast single-probe lookup

[Hardavellas et al, ISCA 2009][Hardavellas et al, IEEE-Micro Top Picks 2010]

10 © Hardavellas

Outline

• Introduction• Why do Cache Accesses Matter?• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion

© Hardavellas11

0

0.5

1

1.5

2

2.5

3

0 10 20 30L2 Cache Size (MB)

CP

IL2-hit stalls Mem stalls Total

Bottleneck shifts from memory to L2-hit stalls

Cache accesses dominate execution

4-core CMPDSS: TPCH/DB21GB database

[Hardavellas et al, CIDR 2007]

Lower isbetter

Ideal

© Hardavellas12

0

0.5

1

1.5

2

2.5

3

0 10 20 30L2 Cache Size (MB)

No

rm.

Th

rou

gh

pu

tDSS-const DSS-real

How much do we lose?

We lose half the potential throughput

4-core CMPDSS: TPCH/DB21GB database

Higher isbetter

13 © Hardavellas

Outline

• Introduction• Why do Cache Accesses Matter?• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion

14 © Hardavellas

Terminology: Data Types

core

L2

core

L2

core core

L2

core

Reador

Write

ReadRead Read

Write

Private SharedRead-Only

SharedRead-Write

© Hardavellas15

Distributed shared L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Maximum capacity, but slow access (30+ cycles)

addressmod <#slices>

Unique locationfor any block(private or shared)

© Hardavellas16

L2

Distributed private L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Fast access to core-private data

Private data:allocate atlocal L2 slice

On every accessallocate dataat local L2 slice

© Hardavellas17

L2

Distributed private L2: shared-RO access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Wastes capacity due to replication

Shared read-onlydata: replicateacross L2 slices


© Hardavellas18

Distributed private L2: shared-RW access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 dir

core core core core

L2 L2 L2 L2

Slow for shared read-writeWastes capacity (dir overhead) and bandwidth

XShared read-writedata: maintaincoherence viaindirection (dir)


19 © Hardavellas

Conventional Multi-Core Caches

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

dir L2 L2 L2

We want: high capacity (shared) + fast access (private)

PrivateShared

Address-interleave blocks+ High capacity− Slow access

Each block cached locally+ Fast access (local)− Low capacity (replicas)− Coherence: via indirection

(distributed directory)

20 © Hardavellas

Where to Place the Data?• Close to where they are used!• Accessed by single core: migrate locally• Accessed by many cores: replicate (?)

If read-only, replication is OK If read-write, coherence a problem

Low reuse: evenly distribute across sharers

sharers#

read-write

mig

rate

replicate

share

read-only

21

MethodologyFlexus: Full-system cycle-accurate timing simulation

Model Parameters• Tiled, LLC = L2• Server/Scientific wrkld.

16-cores, 1MB/core• Multi-programmed wrkld.

8-cores, 3MB/core• OoO, 2GHz, 96-entry ROB• Folded 2D-torus

2-cycle router, 1-cycle link• 45ns memory

Workloads• OLTP: TPC-C 3.0 100 WH

IBM DB2 v8 Oracle 10g

• DSS: TPC-H Qry 6, 8, 13 IBM DB2 v8

• SPECweb99 on Apache 2.0• Multiprogammed: Spec2K• Scientific: em3d

© Hardavellas

[Hardavellas et al, SIGMETRICS-PER 2004Wenisch et al, IEEE Micro 2006]

22 © Hardavellas

Cache Access Classification Example

-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ead

-Wri

te B

lock

s

Instructions Data-Private Data-Shared

0

% L2 accesses

• Each bubble: cache blocks shared by x cores• Size of bubble proportional to % L2 accesses• y axis: % blocks in bubble that are read-write

% R

W B

lock

s in

Bu

bb

le

23 © Hardavellas

Scientific/MP Apps

-20%

0%

20%

40%

60%

80%

100%

120%

-4 -2 0 2 4 6 8 10 12 14 16 18 20

Number of Sharers%

Rea

d-W

rite

Blo

cks

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0%% 0-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ead

-Wri

te B

lock

s

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0

Cache Access Clustering

Accesses naturally form 3 clusters

Server Appsmigratelocally

share (addr-interleave)

replicate

R/W

mig

rat

e

replicate

share

R/O

sharers#% R

W B

lock

s in

Bu

bb

le

% R

W B

lock

s in

Bu

bb

le

24

Instruction Replication

© Hardavellas

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Distribute in cluster of neighbors, replicate across

• Instruction working set too large for one cache slice

© Hardavellas25

Reactive NUCA in a nutshell

• Classify accesses private data: like private scheme (migrate) shared data: like shared scheme (interleave) instructions: controlled replication (middle ground)

To place cache blocks, we first need to classify them

26 © Hardavellas

Outline

• Introduction• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion

© Hardavellas27

Classification Granularity

• Per-block classification High area/power overhead (cut L2 size by half) High latency (indirection through directory)

• Per-page classification (utilize OS page table) Persistent structure Core accesses the page table for every access anyway (TLB) Utilize already existing SW/HW structures and events Page classification is accurate (<0.5% error)

Classify entire data pages, page table/TLB for bookkeeping

© Hardavellas28

• Instructions classification: all accesses from L1-I (per-block)• Data classification: private/shared per-page at TLB miss

Classification Mechanisms

TLB Misscore

L2

Ld ACore i

OS

A: Private to “i”

TLB MissLd A

OS

A: Private to “i”

core

L2

Core j

A: Shared

On 1st access On access by another core

Bookkeeping through OS page table and TLB

© Hardavellas29

Page Table and TLB Extensions

vpage ppageL2 idP/S/I

2 bits log(n)

vpage ppageP/STLB entry:

1 bit

Page granularity allows simple + practical HW

Page table entry:

• Core accesses the page table for every access anyway (TLB) Pass information from the “directory” to the core

• Utilize already existing SW/HW structures and events

© Hardavellas30

Data Class Bookkeeping and Lookup

offsetPhysical Addr.:

vpage ppageL2 id

vpage ppageL2 idS

cache indextag

Page table entry:

Page table entry:

vpage ppagePTLB entry:

L2 id

vpage ppageSTLB entry:

P

• private data: place in local L2 slice

• shared data: place in aggregate L2 (addr interleave)

31 © Hardavellas

Coherence: No Need for HW Mechanisms at LLC

Fast access, eliminates HW overhead, SIMPLE

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Private data: local sliceShared data: addr-interleave

• Reactive NUCA placement guarantee Each R/W datum in unique & known location

© Hardavellas32

each slice caches the same blockson behalf of any cluster

3 100 1 32 0

3 1

3 100 1 32 0

3 1

Instructions Lookup: Rotational Interleaving

2 2 3 101 32

2 0 2 3 100 1 32 0 1 32+1

+log2(k)

)1(&1 nRIDAddrnDestinatio

RID

Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices

PC: 0xfa480

RID

Addr size-4 clusters:local slice + 3 neighbors

33 © Hardavellas

Outline

• Introduction• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion

34

-20%-10%

0%10%20%30%40%50%60%

S R I S R I S R I S R I S R I S R I S R I S R I

OLTP DB2

Apache DSS Qry6

DSS Qry8

DSS Qry13

em3d OLTP Oracle

MIX

Private-averse workloads Shared-averse workloads

Sp

eed

up

ove

r P

riva

te

© 2009 Hardavellas

Evaluation

Delivers robust performance across workloadsShared: same for Web, DSS; 17% for OLTP, MIXPrivate: 17% for OLTP, Web, DSS; same for MIX

Shared (S) R-NUCA (R) Ideal (I)

© Hardavellas35

Conclusions

• Data may exhibit arbitrarily complex behaviors ...but few that matter!

• Learn the behaviors that matter at run time Make the common case fast, the rare case correct

• Reactive NUCA: near-optimal cache block placement Simple, scalable, low-overhead, transparent, no coherence Robust performance

Matches best alternative, or 17% better; up to 32% Near-optimal placement (-5% avg. from ideal)

36

For more information:

http://www.eecs.northwestern.edu/~hardav/

© Hardavellas

Thank You!

• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp. 20-28, January/February 2010.

• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. ISCA 2009.

37 © 2009 Hardavellas

BACKUP SLIDES

© Hardavellas38

Why Are Caches Growing So Large?

• Increasing number of cores: cache grows commensurately Fewer but faster cores have the same effect

• Increasing datasets: faster than Moore’s Law!• Power/thermal efficiency: caches are “cool”, cores are “hot”

So, its easier to fit more cache in a power budget• Limited bandwidth: large cache == more data on chip

Off-chip pins are used less frequently


Backup Slides

ASR


ASR vs. R-NUCA ConfigurationsASR-1 ASR-2 R-NUCA

12.5× 25.0× 5.6×

2.1× 2.2× 38%

Core Type In-Order OoO OoO

L2 Size (MB) 4 16 16

Memory 150 500 90

Local L2 12 20 16

Avg. Shared L2 25 44 22

2 LLocalMemory

2 2

LLocalLShared

© Hardavellas41

ASR design space search

-6%

-4%

-2%

0%

2%

4%

6%

OLTP DB2

Apache DSS Qry8

em3d OLTP Oracle

MIX

Sp

eed

up

ove

r p

riva

te

ASR Alloc 0% Alloc 25%Alloc 50% Alloc 75% Alloc 100%


Backup Slides

Prior Work

43 © Hardavellas

Prior Work

• Several proposals for CMP cache management ASR, cooperative caching, victim replication,

CMP-NuRapid, D-NUCA

• ...but suffer from shortcomings complex, high-latency lookup/coherence don’t scale lower effective cache capacity optimize only for subset of accesses

We need:Simple, scalable mechanism for fast access to all data

© Hardavellas44

Shortcomings of prior work

• L2-Private Wastes capacity High latency (3 slice accesses + 3 hops on shr.)

• L2-Shared High latency

• Cooperative Caching Doesn’t scale (centralized tag structure)

• CMP-NuRapid High latency (pointer dereference, 3 hops on shr)

• OS-managed L2 Wastes capacity (migrates all blocks) Spill to neighbors useless (all run same code)

© Hardavellas45

Shortcomings of Prior Work

• D-NUCA No practical implementation (lookup?)

• Victim Replication High latency (like L2-Private) Wastes capacity (home always stores block)

• Adaptive Selective Replication (ASR) High latency (like L2-Private) Capacity pressure (replicates at slice granularity) Complex (4 separate HW structures to bias coin)


Backup Slides

Classification and Lookup


Data Classification Timeline

TLB Miss

OS

vpage ppageiP

core

L2Ld A

Core i

allocate A

vpage ppagexS

TLB Miss

core

L2Ld A

Core j

i≠j

inval ATLBi

evict Acore

L2

Core k

allocate A

reply A

Fast & simple lookup for data

© Hardavellas48

Misclassifications at Page Granularity

Classification at page granularity is accurate

Accesses from pages withmultiple access types

Access misclassifications

• A page may service multiple access types• But, one type always dominates accesses

0%

20%

40%

60%

80%

100%

OLT

P D

B2

OLT

P O

racl

e

Apa

che

DS

S Q

ry6

DS

S Q

ry8

DS

S Q

ry13

em3d

MIXTo

tal L

2 A

cce

sse

s

One Class Instructions+Data Private+Shared Data

0%

20%

40%

60%

80%

100%

OLT

P D

B2

OLT

P O

racl

e

Apa

che

DS

S Q

ry6

DS

S Q

ry8

DS

S Q

ry13

em3d

MIXTo

tal L

2 A

cces

ses

Private Data as Shared Correct


Backup Slides

Placement

© Hardavellas50

• Spill to neighbors if working set too large? NO!!! Each core runs similar threads

Private Data Placement

Store in local L2 slice (like in private cache)

0%

20%

40%

60%

80%

100%

1 416 64 25

61,

024

4,09

616

,384

65,5

3626

2,14

41,

048,

576

Tota

l L

2 A

cces

ses

(CD

F)

Private Data (KB)

OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX

© Hardavellas51

Private Data Working Set

• OLTP: Small per-core work. set (3MB/16 cores = 200KB/core)• Web: primary wk. set <6KB/core, remaining <1.5% L2 refs• DSS: Policy doesn’t matter much

(>100MB work. set, <13% L2 refs very low reuse on private)

0%

20%

40%

60%

80%

100%

1 416 64 25

61,

024

4,09

616

,384

65,5

3626

2,14

41,

048,

576

Tota

l L

2 A

cces

ses

(CD

F)

Private Data (KB)


© Hardavellas52

• Read-write + large working set + low reuse Unlikely to be in local slice for reuse

• Also, next sharer is random [WMPI’04]

Shared Data Placement

Address-interleave in aggregate L2 (like shared cache)

0%

20%

40%

60%

80%

100%

1 416 64 256

1,02

44,

096

16,3

8465

,536

262,

144

1,04

8,57

6

Tota

l L

2 A

cces

ses

(CD

F)

Shared Data (KB)

OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX 0%

20%

40%

60%

80%

100%

OLT

PD

B2

OLT

PO

racl

e

Apa

che

DS

S Q

ry6

DS

SQ

ry8

DS

S Q

ry13

em3d

MIX

Shared DataTo

tal L

2 A

cces

ses

1st access2nd access3rd-4th access5th-8th access9+ access

© Hardavellas53

Shared Data Working Set

0%

20%

40%

60%

80%

100%

1 416 64 256

1,02

44,

096

16,3

8465

,536

262,

144

1,04

8,57

6

Tota

l L

2 A

cces

ses

(CD

F)

Shared Data (KB)


© Hardavellas54

Instruction Placement

• Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 L2 slices

Share in clusters of neighbors, replicate across

0%

20%

40%

60%

80%

100%

1 4

16 64 256

1,02

4

4,09

6Tota

l L

2 A

cces

ses

(CD

F)

Instructions (KB)

OLTP DB2

OLTP Oracle

Apache

DSS Qry6

DSS Qry8

DSS Qry13

em3d

MIX0%

20%

40%

60%

80%

100%

OLT

PD

B2

OLT

PO

racl

e

Apa

che

DS

S Q

ry6

DS

SQ

ry8

DS

S Q

ry13

em3d

MIX

Instructions

To

tal L

2 A

cc

es

se

s 1st access2nd access3rd-4th access5th-8th access9+ access

© Hardavellas55

Instructions Working Set

0%

20%

40%

60%

80%

100%

1 4

16 64 256

1,02

4

4,09

6Tota

l L

2 A

cces

ses

(CD

F)

Instructions (KB)

OLTP DB2

OLTP Oracle

Apache

DSS Qry6

DSS Qry8

DSS Qry13

em3d

MIX


Backup Slides

Rotational Interleaving

57

Instruction Classification and Lookup

© 2009 Hardavellas

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Share within neighbors’ cluster, replicate across

• Identification: all accesses from L1-I• But, working set too large to fit in one cache slice

58

RotationalID

0

© 2009 Hardavellas

3 101 32 03 1

Rotational Interleaving

2 2 3 101 32

2 0 2 3 100 1 32 0 1 32

+1

+log2(k)

Fast access (nearest-neighbor, simple lookup) Equalize capacity pressure at overlapping slices

1 & 1 nIDRotationalIDRotationalD centerdest

dest

center

center

dest

TileID

TileID

DIDRotational

IDRotationalAddr

1625 272617 1918 209 11

24 28 29 313021 2322

8 10 12 13 15140 1 32 4 5 76TileID

© Hardavellas59

Nearest-neighbor size-8 clusters

1 2 436 7 103 4 65

1 320

5 6 072 3 547 0 21

5 764

5 6 072 3 547 0 21

5 764

1 2 436 7 103 4 65

1 320

DC

1 2 436 7 103 4 65

1 320

5 6 072 3 547 0 21

5 764

5 6 072 3 547 0 21

5 764

1 2 436 7 103 4 65

1 320

DC

Documents

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi,