Upload
natasha-vick
View
226
Download
1
Tags:
Embed Size (px)
Citation preview
Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
Nikos Hardavellas, Northwestern University
Team: M. Ferdman, B. Falsafi, A. AilamakiNorthwestern, Carnegie Mellon, EPFL
© Hardavellas2
Moore’s Law Is Alive And Well
90nm
90nm transistor(Intel, 2005)
Swine Flu A/H1N1(CDC)
65nm2007
45nm2010
32nm2013
22nm2016
16nm2019
Device scaling continues for at least another 10 years
© Hardavellas3
Good Days Ended Nov. 2002
[Yelick09]
“New” Moore’s Law: 2x cores with every generationOn-chip cache grows commensurately to supply all cores with data
Moore’s Law Is Alive And Well
© Hardavellas4
1
10
100
1,000
10,000
100,000
1990 2000 2010Year
L2
Cac
he
Siz
e (K
B)
0
5
10
15
20
25
1990 2000 2010Year
L2
Hit
Lat
ency
(cy
cles
)
slow access
large caches
Larger Caches Are Slower Caches
Increasing access latency forces caches to be distributed
© Hardavellas5
Cache design trends
Balance cache slice access with network latency
As caches become bigger, they get slower:
Split cache into smaller “slices”:
© Hardavellas6
corecorecore
Modern Caches: Distributed
Split cache into “slices”, distribute across die
L2 L2 L2 L2
L2 L2 L2 L2
core
core core core core
7
Data Placement Determines Performance
© Hardavellas
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Goal: place data on chip close to where they are used
cacheslice
© Hardavellas8
Our proposal: R-NUCAReactive Nonuniform Cache Architecture
• Data may exhibit arbitrarily complex behaviors ...but few that matter!
• Learn the behaviors at run time & exploit their characteristics Make the common case fast, the rare case correct Resolve conflicting requirements
9 © Hardavellas
Reactive Nonuniform Cache Architecture
• Cache accesses can be classified at run-time Each class amenable to different placement
• Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Up to 32% speedup (17% on average)
-5% on avg. from an ideal cache organization
• Rotational Interleaving Data replication and fast single-probe lookup
[Hardavellas et al, ISCA 2009][Hardavellas et al, IEEE-Micro Top Picks 2010]
10 © Hardavellas
Outline
• Introduction• Why do Cache Accesses Matter?• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion
© Hardavellas11
0
0.5
1
1.5
2
2.5
3
0 10 20 30L2 Cache Size (MB)
CP
IL2-hit stalls Mem stalls Total
Bottleneck shifts from memory to L2-hit stalls
Cache accesses dominate execution
4-core CMPDSS: TPCH/DB21GB database
[Hardavellas et al, CIDR 2007]
Lower isbetter
Ideal
© Hardavellas12
0
0.5
1
1.5
2
2.5
3
0 10 20 30L2 Cache Size (MB)
No
rm.
Th
rou
gh
pu
tDSS-const DSS-real
How much do we lose?
We lose half the potential throughput
4-core CMPDSS: TPCH/DB21GB database
Higher isbetter
13 © Hardavellas
Outline
• Introduction• Why do Cache Accesses Matter?• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion
14 © Hardavellas
Terminology: Data Types
core
L2
core
L2
core core
L2
core
Reador
Write
ReadRead Read
Write
Private SharedRead-Only
SharedRead-Write
© Hardavellas15
Distributed shared L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Maximum capacity, but slow access (30+ cycles)
addressmod <#slices>
Unique locationfor any block(private or shared)
© Hardavellas16
L2
Distributed private L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Fast access to core-private data
Private data:allocate atlocal L2 slice
On every accessallocate dataat local L2 slice
© Hardavellas17
L2
Distributed private L2: shared-RO access
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Wastes capacity due to replication
Shared read-onlydata: replicateacross L2 slices
On every accessallocate dataat local L2 slice
© Hardavellas18
Distributed private L2: shared-RW access
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 dir
core core core core
L2 L2 L2 L2
Slow for shared read-writeWastes capacity (dir overhead) and bandwidth
XShared read-writedata: maintaincoherence viaindirection (dir)
On every accessallocate dataat local L2 slice
19 © Hardavellas
Conventional Multi-Core Caches
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
dir L2 L2 L2
We want: high capacity (shared) + fast access (private)
PrivateShared
Address-interleave blocks+ High capacity− Slow access
Each block cached locally+ Fast access (local)− Low capacity (replicas)− Coherence: via indirection
(distributed directory)
20 © Hardavellas
Where to Place the Data?• Close to where they are used!• Accessed by single core: migrate locally• Accessed by many cores: replicate (?)
If read-only, replication is OK If read-write, coherence a problem
Low reuse: evenly distribute across sharers
sharers#
read-write
mig
rate
replicate
share
read-only
21
MethodologyFlexus: Full-system cycle-accurate timing simulation
Model Parameters• Tiled, LLC = L2• Server/Scientific wrkld.
16-cores, 1MB/core• Multi-programmed wrkld.
8-cores, 3MB/core• OoO, 2GHz, 96-entry ROB• Folded 2D-torus
2-cycle router, 1-cycle link• 45ns memory
Workloads• OLTP: TPC-C 3.0 100 WH
IBM DB2 v8 Oracle 10g
• DSS: TPC-H Qry 6, 8, 13 IBM DB2 v8
• SPECweb99 on Apache 2.0• Multiprogammed: Spec2K• Scientific: em3d
© Hardavellas
[Hardavellas et al, SIGMETRICS-PER 2004Wenisch et al, IEEE Micro 2006]
22 © Hardavellas
Cache Access Classification Example
-20%
0%
20%
40%
60%
80%
100%
120%
0 2 4 6 8 10 12 14 16 18 20
Number of Sharers
% R
ead
-Wri
te B
lock
s
Instructions Data-Private Data-Shared
0
% L2 accesses
• Each bubble: cache blocks shared by x cores• Size of bubble proportional to % L2 accesses• y axis: % blocks in bubble that are read-write
% R
W B
lock
s in
Bu
bb
le
23 © Hardavellas
Scientific/MP Apps
-20%
0%
20%
40%
60%
80%
100%
120%
-4 -2 0 2 4 6 8 10 12 14 16 18 20
Number of Sharers%
Rea
d-W
rite
Blo
cks
Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared
0%% 0-20%
0%
20%
40%
60%
80%
100%
120%
0 2 4 6 8 10 12 14 16 18 20
Number of Sharers
% R
ead
-Wri
te B
lock
s
Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared
0
Cache Access Clustering
Accesses naturally form 3 clusters
Server Appsmigratelocally
share (addr-interleave)
replicate
R/W
mig
rat
e
replicate
share
R/O
sharers#% R
W B
lock
s in
Bu
bb
le
% R
W B
lock
s in
Bu
bb
le
24
Instruction Replication
© Hardavellas
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Distribute in cluster of neighbors, replicate across
• Instruction working set too large for one cache slice
© Hardavellas25
Reactive NUCA in a nutshell
• Classify accesses private data: like private scheme (migrate) shared data: like shared scheme (interleave) instructions: controlled replication (middle ground)
To place cache blocks, we first need to classify them
26 © Hardavellas
Outline
• Introduction• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion
© Hardavellas27
Classification Granularity
• Per-block classification High area/power overhead (cut L2 size by half) High latency (indirection through directory)
• Per-page classification (utilize OS page table) Persistent structure Core accesses the page table for every access anyway (TLB) Utilize already existing SW/HW structures and events Page classification is accurate (<0.5% error)
Classify entire data pages, page table/TLB for bookkeeping
© Hardavellas28
• Instructions classification: all accesses from L1-I (per-block)• Data classification: private/shared per-page at TLB miss
Classification Mechanisms
TLB Misscore
L2
Ld ACore i
OS
A: Private to “i”
TLB MissLd A
OS
A: Private to “i”
core
L2
Core j
A: Shared
On 1st access On access by another core
Bookkeeping through OS page table and TLB
© Hardavellas29
Page Table and TLB Extensions
vpage ppageL2 idP/S/I
2 bits log(n)
vpage ppageP/STLB entry:
1 bit
Page granularity allows simple + practical HW
Page table entry:
• Core accesses the page table for every access anyway (TLB) Pass information from the “directory” to the core
• Utilize already existing SW/HW structures and events
© Hardavellas30
Data Class Bookkeeping and Lookup
offsetPhysical Addr.:
vpage ppageL2 id
vpage ppageL2 idS
cache indextag
Page table entry:
Page table entry:
vpage ppagePTLB entry:
L2 id
vpage ppageSTLB entry:
P
• private data: place in local L2 slice
• shared data: place in aggregate L2 (addr interleave)
31 © Hardavellas
Coherence: No Need for HW Mechanisms at LLC
Fast access, eliminates HW overhead, SIMPLE
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Private data: local sliceShared data: addr-interleave
• Reactive NUCA placement guarantee Each R/W datum in unique & known location
© Hardavellas32
each slice caches the same blockson behalf of any cluster
3 100 1 32 0
3 1
3 100 1 32 0
3 1
Instructions Lookup: Rotational Interleaving
2 2 3 101 32
2 0 2 3 100 1 32 0 1 32+1
+log2(k)
)1(&1 nRIDAddrnDestinatio
RID
Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices
PC: 0xfa480
RID
Addr size-4 clusters:local slice + 3 neighbors
33 © Hardavellas
Outline
• Introduction• Access Classification and Block Placement• Reactive NUCA Mechanisms• Evaluation• Conclusion
34
-20%-10%
0%10%20%30%40%50%60%
S R I S R I S R I S R I S R I S R I S R I S R I
OLTP DB2
Apache DSS Qry6
DSS Qry8
DSS Qry13
em3d OLTP Oracle
MIX
Private-averse workloads Shared-averse workloads
Sp
eed
up
ove
r P
riva
te
© 2009 Hardavellas
Evaluation
Delivers robust performance across workloadsShared: same for Web, DSS; 17% for OLTP, MIXPrivate: 17% for OLTP, Web, DSS; same for MIX
Shared (S) R-NUCA (R) Ideal (I)
© Hardavellas35
Conclusions
• Data may exhibit arbitrarily complex behaviors ...but few that matter!
• Learn the behaviors that matter at run time Make the common case fast, the rare case correct
• Reactive NUCA: near-optimal cache block placement Simple, scalable, low-overhead, transparent, no coherence Robust performance
Matches best alternative, or 17% better; up to 32% Near-optimal placement (-5% avg. from ideal)
36
For more information:
http://www.eecs.northwestern.edu/~hardav/
© Hardavellas
Thank You!
• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp. 20-28, January/February 2010.
• N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. ISCA 2009.
37 © 2009 Hardavellas
BACKUP SLIDES
© Hardavellas38
Why Are Caches Growing So Large?
• Increasing number of cores: cache grows commensurately Fewer but faster cores have the same effect
• Increasing datasets: faster than Moore’s Law!• Power/thermal efficiency: caches are “cool”, cores are “hot”
So, its easier to fit more cache in a power budget• Limited bandwidth: large cache == more data on chip
Off-chip pins are used less frequently
39 © 2009 Hardavellas
Backup Slides
ASR
40 © 2009 Hardavellas
ASR vs. R-NUCA ConfigurationsASR-1 ASR-2 R-NUCA
12.5× 25.0× 5.6×
2.1× 2.2× 38%
Core Type In-Order OoO OoO
L2 Size (MB) 4 16 16
Memory 150 500 90
Local L2 12 20 16
Avg. Shared L2 25 44 22
2 LLocalMemory
2 2
LLocalLShared
© Hardavellas41
ASR design space search
-6%
-4%
-2%
0%
2%
4%
6%
OLTP DB2
Apache DSS Qry8
em3d OLTP Oracle
MIX
Sp
eed
up
ove
r p
riva
te
ASR Alloc 0% Alloc 25%Alloc 50% Alloc 75% Alloc 100%
42 © 2009 Hardavellas
Backup Slides
Prior Work
43 © Hardavellas
Prior Work
• Several proposals for CMP cache management ASR, cooperative caching, victim replication,
CMP-NuRapid, D-NUCA
• ...but suffer from shortcomings complex, high-latency lookup/coherence don’t scale lower effective cache capacity optimize only for subset of accesses
We need:Simple, scalable mechanism for fast access to all data
© Hardavellas44
Shortcomings of prior work
• L2-Private Wastes capacity High latency (3 slice accesses + 3 hops on shr.)
• L2-Shared High latency
• Cooperative Caching Doesn’t scale (centralized tag structure)
• CMP-NuRapid High latency (pointer dereference, 3 hops on shr)
• OS-managed L2 Wastes capacity (migrates all blocks) Spill to neighbors useless (all run same code)
© Hardavellas45
Shortcomings of Prior Work
• D-NUCA No practical implementation (lookup?)
• Victim Replication High latency (like L2-Private) Wastes capacity (home always stores block)
• Adaptive Selective Replication (ASR) High latency (like L2-Private) Capacity pressure (replicates at slice granularity) Complex (4 separate HW structures to bias coin)
46 © 2009 Hardavellas
Backup Slides
Classification and Lookup
47 © 2009 Hardavellas
Data Classification Timeline
TLB Miss
OS
vpage ppageiP
core
L2Ld A
Core i
allocate A
vpage ppagexS
TLB Miss
core
L2Ld A
Core j
i≠j
inval ATLBi
evict Acore
L2
Core k
allocate A
reply A
Fast & simple lookup for data
© Hardavellas48
Misclassifications at Page Granularity
Classification at page granularity is accurate
Accesses from pages withmultiple access types
Access misclassifications
• A page may service multiple access types• But, one type always dominates accesses
0%
20%
40%
60%
80%
100%
OLT
P D
B2
OLT
P O
racl
e
Apa
che
DS
S Q
ry6
DS
S Q
ry8
DS
S Q
ry13
em3d
MIXTo
tal L
2 A
cce
sse
s
One Class Instructions+Data Private+Shared Data
0%
20%
40%
60%
80%
100%
OLT
P D
B2
OLT
P O
racl
e
Apa
che
DS
S Q
ry6
DS
S Q
ry8
DS
S Q
ry13
em3d
MIXTo
tal L
2 A
cces
ses
Private Data as Shared Correct
49 © 2009 Hardavellas
Backup Slides
Placement
© Hardavellas50
• Spill to neighbors if working set too large? NO!!! Each core runs similar threads
Private Data Placement
Store in local L2 slice (like in private cache)
0%
20%
40%
60%
80%
100%
1 416 64 25
61,
024
4,09
616
,384
65,5
3626
2,14
41,
048,
576
Tota
l L
2 A
cces
ses
(CD
F)
Private Data (KB)
OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX
© Hardavellas51
Private Data Working Set
• OLTP: Small per-core work. set (3MB/16 cores = 200KB/core)• Web: primary wk. set <6KB/core, remaining <1.5% L2 refs• DSS: Policy doesn’t matter much
(>100MB work. set, <13% L2 refs very low reuse on private)
0%
20%
40%
60%
80%
100%
1 416 64 25
61,
024
4,09
616
,384
65,5
3626
2,14
41,
048,
576
Tota
l L
2 A
cces
ses
(CD
F)
Private Data (KB)
OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX
© Hardavellas52
• Read-write + large working set + low reuse Unlikely to be in local slice for reuse
• Also, next sharer is random [WMPI’04]
Shared Data Placement
Address-interleave in aggregate L2 (like shared cache)
0%
20%
40%
60%
80%
100%
1 416 64 256
1,02
44,
096
16,3
8465
,536
262,
144
1,04
8,57
6
Tota
l L
2 A
cces
ses
(CD
F)
Shared Data (KB)
OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX 0%
20%
40%
60%
80%
100%
OLT
PD
B2
OLT
PO
racl
e
Apa
che
DS
S Q
ry6
DS
SQ
ry8
DS
S Q
ry13
em3d
MIX
Shared DataTo
tal L
2 A
cces
ses
1st access2nd access3rd-4th access5th-8th access9+ access
© Hardavellas53
Shared Data Working Set
0%
20%
40%
60%
80%
100%
1 416 64 256
1,02
44,
096
16,3
8465
,536
262,
144
1,04
8,57
6
Tota
l L
2 A
cces
ses
(CD
F)
Shared Data (KB)
OLTP DB2OLTP OracleApacheDSS Qry6DSS Qry8DSS Qry13em3dMIX
© Hardavellas54
Instruction Placement
• Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 L2 slices
Share in clusters of neighbors, replicate across
0%
20%
40%
60%
80%
100%
1 4
16 64 256
1,02
4
4,09
6Tota
l L
2 A
cces
ses
(CD
F)
Instructions (KB)
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX0%
20%
40%
60%
80%
100%
OLT
PD
B2
OLT
PO
racl
e
Apa
che
DS
S Q
ry6
DS
SQ
ry8
DS
S Q
ry13
em3d
MIX
Instructions
To
tal L
2 A
cc
es
se
s 1st access2nd access3rd-4th access5th-8th access9+ access
© Hardavellas55
Instructions Working Set
0%
20%
40%
60%
80%
100%
1 4
16 64 256
1,02
4
4,09
6Tota
l L
2 A
cces
ses
(CD
F)
Instructions (KB)
OLTP DB2
OLTP Oracle
Apache
DSS Qry6
DSS Qry8
DSS Qry13
em3d
MIX
56 © 2009 Hardavellas
Backup Slides
Rotational Interleaving
57
Instruction Classification and Lookup
© 2009 Hardavellas
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Share within neighbors’ cluster, replicate across
• Identification: all accesses from L1-I• But, working set too large to fit in one cache slice
58
RotationalID
0
© 2009 Hardavellas
3 101 32 03 1
Rotational Interleaving
2 2 3 101 32
2 0 2 3 100 1 32 0 1 32
+1
+log2(k)
Fast access (nearest-neighbor, simple lookup) Equalize capacity pressure at overlapping slices
1 & 1 nIDRotationalIDRotationalD centerdest
dest
center
center
dest
TileID
TileID
DIDRotational
IDRotationalAddr
1625 272617 1918 209 11
24 28 29 313021 2322
8 10 12 13 15140 1 32 4 5 76TileID
© Hardavellas59
Nearest-neighbor size-8 clusters
1 2 436 7 103 4 65
1 320
5 6 072 3 547 0 21
5 764
5 6 072 3 547 0 21
5 764
1 2 436 7 103 4 65
1 320
DC
1 2 436 7 103 4 65
1 320
5 6 072 3 547 0 21
5 764
5 6 072 3 547 0 21
5 764
1 2 436 7 103 4 65
1 320
DC