Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CloudCache:Expanding and Shrinking Private Caches
CloudCache:Expanding and Shrinking Private Caches
Sangyeun Cho
Computer Science DepartmentUniversity of Pittsburgh
Sangyeun Cho
Computer Science DepartmentUniversity of Pittsburgh
University of Pittsburgh
Credits
Parts of the work presented in this talk are from the results obtained in collaboration with students and faculty at the University of Pittsburgh:• Mohammad Hammoud
• Lei Jin
• Hyunjin Lee
• Kiyeon Lee
• Prof. Bruce Childers
• Prof. Rami Melhem
Partial support has come as:• NSF grant CCF-0702236
• NSF grant CCF-0952273
• A. Richard Newton Graduate Scholarship, ACM DAC 2008
University of Pittsburgh
Recent multicore design trends
Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors may look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]
University of Pittsburgh
Recent multicore design trends
Modular designs based on relatively simple cores• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors will look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]
Design focus shifts from cores to “uncores”• Network on-chip (NoC): bus, crossbar, ring, 2D/3D mesh, …• Memory hierarchy: caches, coherence mechanisms, memory
controllers, …• System-on-chip components like network MAC, cypher, …
Note also:• In this design, aggregate cache capacity increases as we add more tiles• Some tiles may be active while others are inactive
University of Pittsburgh
L2 cache design issues
L2 caches occupy a major portion (prob. the largest) on chip
Capacity vs. latency• Off-chip access (still) expensive
• On-chip latency can grow; with switched on-chip networks, we have NUCA (Non-Uniform Cache Architecture)
• Private cache vs. shared cache or a hybrid scheme?
Quality of service• Interference (capacity/bandwidth)
• Explicit/implicit capacity partitioning
Manageability, fault and yield• Not all caches have to be active
• Hard (irreversible) faults + process variation + soft faults
University of Pittsburgh
Private cache vs. shared cache
Short hit latency (always local)
High on-chip miss rate
Long miss resolution time (i.e., are there replicas?
Low on-chip miss rate
Straightforward data location
Long average hit latency
Uncontrolled interference
University of Pittsburgh
Historical perspective
Private designs were there• Multiprocessors based on microprocessors
Shared cache clustering [Nayfeh and Olukotun, ISCA ‘94]
Multicores with private design more straightforward• Earlier AMD chips and Sun Microsystems’ UltraSPARC processors
Intel, IBM, and Sun pushed for shared caches (e.g., Xeon products, POWER-4/5/6, Niagara*)• A variety of workloads perform relatively well
More recently, private caches re-gain popularity• We have more transistors• Interference is easily controlled• Match well with on-/off-chip (shared) L3 $$ [Oh et al., ISVLSI ‘09]
University of Pittsburgh
Shared $$ issue 1: NUCA
A compromise design• Large monolithic shared cache: conceptually simple but very slow
• Smaller cache slices w/ disparate latencies [Kim et al., ASPLOS ‘02]
Interleave data to all available slices (e.g., IBM POWER-*)
Problems• Blind data distribution with no provisions for latency hiding
• As a result, simple NUCA performance is not scalable
yourprogram
yourdata
University of Pittsburgh
Improving program-data distance
Key idea: replicate or migrate
Victim replication [Zhang and Asanović, ISCA ’05]
• Victims from private L1 cache are “replicated” to local L2 bank
• Essentially, local L2 banks incorporate large victim caching space
• A natural hybrid scheme that takes advantages of shared & private $$
Adaptive Selective Replication [Beckmann et al., MICRO ‘06]
• Uncontrolled replication can be detrimental (shared cache degenerates to private cache)
• Based on benefit and cost, control replication level
• Hardware support quite complex
Adaptive Controlled Migration [Hammoud, Cho, and Melhem, HiPEAC ‘09]
• Migrate (rather than replicate) a block to minimize average access latency to this block based on Manhattan distance
• Caveat: book-keeping overheads
University of Pittsburgh
3-way communication
Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]
B
home tile of B
new host of B
another requester
1
3
2
University of Pittsburgh
3-way communication
Once you migrate a cache block away from its home, how do you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]
B
home tile of B
new host of B
requester
1
3
2
University of Pittsburgh
Cache block location strategies
Always-go-home• “Go to the home tile (directory) to find the tracking information”
• Lots of 3-way cache-to-cache transfers
Broadcasting• “Don’t remember anything”
• Expensive
Caching of tracking information at potential requesters• “Remember (locally) where previously accessed blocks are”
• Quickly locate the target block locally (after first miss)
• Need maintenance (coherence) of distributed information
• We proposed and studied one such technique [Hammoud, Cho, and Melhem, HiPEAC ‘09]
University of Pittsburgh
Caching of tracking information
…B
home tile of BL2$$ Dir
.
TrT
Rcore
B primarytracking info.
b
requester tile for B
L2$$ Dir
.
TrT
Rcore
tag ptr
replicatedtracking info.
Primary tracking information at home tile• Shows where the block has been migrated to
• Keeps track of who has copied the tracking information (bit vector)
Tracking information table (TrT)• A copy of the tracking information (only one pointer)
• Updated as necessary (initiated by the directory of the home tile)
B
University of Pittsburgh
A flexible data mapping approach
Key idea: What if we distribute memory pages to cache banks instead of memory blocks? [Cho and Jin, MICRO ‘06]
Memory blocks Memory pages
University of Pittsburgh
Observation
Memory pages Program 1
Program 2
OS PAGE ALLOCATIONOS PAGE ALLOCATION
Software maps data to different $$
Key: OS page allocation policies
Flexible
University of Pittsburgh
Shared $$ issue 2: interference
Co-scheduled programs compete for cache capacity freely (“capitalism”)
Well-behaving programs get damages if another program vie for more cache capacity w/ little reuse (e.g., streaming)
Performance loss due to interference hard to predict
on 8-core Niagara
2×
University of Pittsburgh
Explicit/implicit partitioning
Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB
• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]
Architectural/system mechanisms• Way partitioning (e.g., 3 vs. 2 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]
• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]
…
3 ways to program A 2 ways to program B
University of Pittsburgh
Explicit/implicit partitioning
Strict partitioning based on system directives• E.g., program A gets 768KB and program B 256KB
• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al., MICRO ‘07]
Architectural/system mechanisms• Way partitioning (e.g., 5 vs. 3 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]
• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]
Implicit partitioning based on utility (“utilitarian”)• Way partitioning based on marginal gains [Suh et al., ICS ‘01][Qureshi et al.,
MICRO ‘06]
• Bank level [Lee, Cho and Childers, HPCA ’10]
• Bank + way level [Lee, Cho and Childers, HPCA ’11]
University of Pittsburgh
Explicit partitioning
University of Pittsburgh
Implicit partitioning
(cache capacity) (cache capacity)
University of Pittsburgh
Private $$ issue: capacity borrowing
The main drawback of private caches is fixed (small) capacity Assume that we have an underlying mechanism for enforcing
correctness of cache access semantics; Key idea: borrow (or steal) capacity from others
• How much capacity do I need and opt to borrow?• Who may be a potential capacity donor?• Do we allow performance degradation of a capacity donor?
Cooperative Caching (CC) [Chang and Sohi, ISCA ‘06]
• Stealing coordinated by a centralized directory (is not scalable)
Dynamic Spill Receive (DSR) [Qureshi, HPCA ‘09]
• Each node is a spiller or a receiver• On eviction from a spiller, randomly choose a receiver to copy the data into;
then, for data access later, use broadcasting
StimulusCache [Lee, Cho, and Childers, HPCA ‘10]
• It’s likely we’ll have “excess cache” in the future• Intelligently share the capacity provided by excess caches based on utility
University of Pittsburgh
Core disabling uneconomical
Many unused (excess) L2 caches exist
Problem exacerbated with many cores
0%
5%
10%
15%
20%
25%
30%
35%
40%
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
probab
ility
# of sound cores/ L2 caches
L2 cache
processing logic
core (L2 + proc. Logic)
32 core
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8
Probab
ility
# of sound cores/ L2 caches
8 core L2 cache
processing logic
core (L2 + proc. logic)
University of Pittsburgh
Flow-in#N: # data blocks flowing to EC#N
Hit#N: # data blocks hit at EC#N
Dynamic sharing policy
L2Core 0
L2Core 1
L2Core 2
L2Core 3
EC #1
MainmemoryFlow-in#1
EC #2
Hits#1
Hits#2
Flow-in#2Flow-in#2
University of Pittsburgh
Hit/Flow-in ↑ More ECsHit/Flow-in ↓ Less ECs
Dynamic sharing policy
L2Core 0
L2Core 1
L2Core 2
L2Core 3
EC #1
MainmemoryFlow-in#1
EC #2
Hits#1
Hits#2
Flow-in#2Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 0: At least 1 EC No harmful effect on EC#2
Allocate 2 ECs
Dynamic sharing policy
EC #1
Mainmemory
EC #2
Hits#1
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 1: At least 2 ECs
Allocate 2 ECs
Dynamic sharing policy
EC #1
Mainmemory
EC #2
Hits#2
Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 2: At least 1 ECHarmful effect on EC#2
Allocate 1 EC
Dynamic sharing policy
EC #1
Mainmemory
EC #2
Hits#1
Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 2: No benefit with ECs
Allocate 0 EC
Dynamic sharing policy
EC #1
Mainmemory
EC #2
Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Maximized capacity utilization
Minimized capacity interference
Dynamic sharing policy
Mainmemory
EC #1 EC #2
2
2
1
0
EC#
University of Pittsburgh
Summary: priv. vs. shared caching
For a small- to medium-scale multicore processor, shared caching appears good (for its simplicity)• A variety of workloads run well
• Care must be taken about interferences
• Because of NUCA effects, this approach is not scalable!
For larger-scale multicore processors, private cache schemes appear more promising• Free performance isolation
• Less NUCA effect
• Easier fault isolation
• Capacity stealing a necessity
University of Pittsburgh
Summary: hypothetical caching
(256kB L2 cache slice)
Rel
ativ
e pe
rfor
man
ceto
sha
red
cach
ing
(w/o
pro
filin
g)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
gzip vpr gcc mcf
crafty
parse r eo
ngap
vorte
xbzip
2tw
olf
wupwise swim
mgrid mesa art
equake
ammp
g-mea
n
private w/ profilingshared w/ profilingstatic 2D coloring
Tackle both miss rate and locality through judicious data mapping! [Jin and Cho, ICPP ‘08]
University of Pittsburgh
0
50
100
150
200
250
300
350
400
450
Summary: hypothetical caching
# pa
ges
map
ped
(256kB L2 cache bank)
Program location
GCC
XY
University of Pittsburgh
Summary: hypothetical caching alpha 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1
ammp
art
bzip2
crafty
eon
equake
gap
gcc
gzip
mcf
mesa
mgrid
parser
swim
twolf
vortex
vpr
wupwise
private caching shared caching(256kB L2 cache bank)
University of Pittsburgh
In the remainder of this talk
CloudCache• Private cache based scheme for large-scale multicore processors
• Capacity borrowing at way/bank level & distance awareness
University of Pittsburgh
A 30,000-feet snapshot
1
2
34
5
6
8
9
7
Cache cloud Working core
High hit rate& fast access
Low hit rate& slow access
C
C Working core
Each cloud is an exclusive private cache for a working core
Clouds are built with nearby cache ways organized in a chain
Clouds are built in a two-step process: Dynamic global partitioning and cache cloud formation
University of Pittsburgh
Step 1: Dynamic global partitioning
Global capacity allocator (GCA) runs a partitioning algorithm periodically using “hit counts” information• Each working core sends GCA hit counts from “allocated capacity” and
additional “monitoring capacity” (of 32 ways)
Our partitioning considers both utility [Qureshi & Patt, MICRO ‘06]
and QoS
This step computes how much capacity to give to each core
networknetwork
GCA
Hit countsCache
alloc. info
…
from allocated capacity
from monitoring capacity
Allocationengine
Allocationengine
…
Counter buffer
GCA
University of Pittsburgh
0.75
1. Local L2 cache first
2. Threads w/ larger capacity demand first
3. Closer L2 banks first
4. Allocate capacity as much as possible
Allocation is done only for sound L2 banks
2.751.75
Step 2: Cache cloud formation
0
3
1
2
0
1
2.75
1.25
Goal Capacity to allocate
1.250.25
0
0
Repeat!
University of Pittsburgh
Cache cloud chaining example
1 2
3 4 5
6 7 8
4 1 5 7 3 6 8 24 1 5 7 3 6 8 2
8 6 6 4 4 3 2 2
MRU LRU
Cloud capacity
Core ID
Token count
35
Working core 4
2
6
Working core 2
0-hop distance1-hop distance 2-hop distance
University of Pittsburgh
Architectural support
L2router
Dir
L1 Proc
Global capacity allocator
L2router
Dir
L1 Proc
Global capacity allocatorGlobal capacity allocator
L2
Router
Dir
L1 Proc
Cloud table
Monitor tags
Hit counters
Home Core ID Token #Next Core ID
Home Core ID Token #Next Core ID
Home Core ID Token #Next Core ID
Capacity allocation
Cache cloud chaining
Cloud table resembles StimulusCache’s NECP
Additional hardware is per-way hit counters and monitor tags
We do not need core ID tracking information per block
University of Pittsburgh
Capacity allocation example
Global capacity allocatorGlobal capacity allocator
Seven working cores10.5
812.625
0.75
9.625
14.5
8
University of Pittsburgh
Capacity allocation example
Global capacity allocatorGlobal capacity allocator
10.5
812.625
0.75
9.625
14.5
8
13.125
8.5
7.875
7
1111.5
6
Repartitionevery ‘T’ cycles
University of Pittsburgh
Capacity allocation example
Global capacity allocatorGlobal capacity allocator
13.125
8.57.875
7
11
11.5
6
University of Pittsburgh
CloudCache is fast?
Remote L2 access has 3-way communication
Global capacity allocatorGlobal capacity allocator
(1) Directory lookup
(2) Request forwarding
(3) Data forwarding
Distance-aware cachecloud formation tacklesonly (3)!
University of Pittsburgh
Limited target broadcast (LTB)
Make common case “super fast” and rare case “not so fast”
Global capacity allocatorGlobal capacity allocator
Private data: Limited target broadcast No wait for directory lookup
Shared data:Directory-based coherence
Private data ≫ Shared data
University of Pittsburgh
LTB example
w/o broadcasting1 1
2 3 4 5
6
2 3 4 5
11
1 6w/ broadcasting
Dir. request BroadcastLocal L2 hit
Dir. response Broadcast hit
time
time
University of Pittsburgh
When is LTB used?
Shared data: Access ordering is managed by directory• LTBP is NOT used
Private data: Access ordering is not needed• Fast access with broadcast first, then notify directory
Race condition?• When there is an access request for private data from a non-owner core
before directory is updated
University of Pittsburgh
LTB protocol (LTBP)
RR
WW BB
start
1
2
3
4
Input Action
1 Request from other cores Broadcast lock request
2 Ack from the owner Process non-owner request
3 Nack from the owner
4 Request from the ownerR: ReadyW: WaitB: Busy
BLBL BUBU
BU: Broadcast unlockBL: Broadcast lock
Directory
L2 cache1
2
Input Action
1 Broadcast lock request Ack
2 Invalidation
University of Pittsburgh
Quality of Service (QoS) support
QoS Maximumperformancedegradation samplingfactor, 32inthiswork privatecachecapacity, 8inthiswork currentlyallocatedcapacity
Base cycle: Cycles (time) with private cache (estimated)
.Basecyclecurrentcycle ∑ currentcyclecurrentcycle ∑
Estimated cycle: Cycles with cache capacity ‘j’
Estimatedcycle j Basecycle ∑ .
Min j s. t. Estimatedcycle j / 1 QoS Basecycle
University of Pittsburgh
DSR vs. ECC vs. CloudCache
Dynamic Spill Receive [Quresh, HPCA ’09]
• Node is either spiller or receiver depending on memory usage
• Spiller nodes randomly “spill” evicted data to a receiver node
Elastic Cooperative Caching [Herrero et al., ISCA ‘10]
• Each node has private area and shared area
• Nodes with high memory demand can spill evicted data to shared area in a randomly selected node
University of Pittsburgh
Experimental setup
TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]
• 64-core CMP with 8×8 2D mesh, 4-cycles/hop
• Core: Intel’s ATOM-like two-issue in-order pipeline
• Directory-based MESI protocol
• Four independent DRAM controllers, four ports/controller
• DRAM with Samsung DDR3-1600 timing
Workloads• SPEC2006 (10B cycles)
High/medium/low based on MPKI for varying cache capacity
• PARSEC (simlarge input set)16 threads/application
University of Pittsburgh
Impact of global partitioning
High
Medium
Low
High
MP
KI
University of Pittsburgh
Impact of global partitioning
High
Medium
Low
High
MP
KI
University of Pittsburgh
Impact of global partitioning
High
Medium
Low
High
MP
KI
University of Pittsburgh
Impact of global partitioning
High
Medium
Low
High
MP
KI
University of Pittsburgh
Impact of global partitioning
High
Medium
Low
High
MP
KI
University of Pittsburgh
L2 cache access latency
Access # 401.bzip2
Private
DSR
Shared
0%
50%
100%
CloudCache w/o LTB0.E+00
1.E+06
2.E+06
0%
50%
100%
0.E+00
1.E+06
2.E+06
0%
50%
100%
ECC0.E+00
1.E+06
2.E+06
0%
50%
100%
0.E+00
1.E+06
2.E+060%
50%
100%
0.E+00
1.E+06
2.E+06
0%
50%
100%
0 100 200 300
CloudCache w/ LTB
access latency
0.E+00
1.E+06
2.E+06
0 100 200 300
Shared: widely spread latency
Private: fast local access + off-chip access
DSR & ECC: fast local access + widely spread latency
CloudCache w/o LTBfast local access + narrowly spread latency
CloudCache w/ LTBfast local access + fast remote access
University of Pittsburgh
16 threads
1
University of Pittsburgh
32 and 64 threads
University of Pittsburgh
PARSEC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Comb1 Comb2 Comb3 Comb4 Comb5
spee
du
p t
o p
riva
te
Shared DSR ECC CLOUD
University of Pittsburgh
PARSEC
0
0.5
1
1.5
2
swaption blacksch. canneal bodytrack swaption facesim canneal ferret
Comb2 Comb5
spee
du
p t
o p
riva
te
Shared DSR ECC CLOUD
University of Pittsburgh
Effect of QoS enforcement
0.9
1
1.1
1.2
1.3
1.4
1.5
1 21 41 61 81 101 121 141 161 181
speedup to private
No QoS 98% QoS 95% QoS
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
1 11 21 31 41 51 61 71 81 91
different programs
University of Pittsburgh
CloudCache summary
Future processors will carry many more cores and cache resources and future workloads will be heterogeneous
Low-level caches must become more scalable and flexible
We proposed CloudCache w/ three techniques:• Global “private” capacity allocation to eliminate interference and to
minimize on-chip misses
• Distance-aware data placement to overcome NUCA latency
• Limited target broadcast to overcome directory lookup latency
Proposed techniques synergistically improve performance
Still, we find that global capacity allocation the most effective
QoS is naturally supported in CloudCache
University of Pittsburgh
Our multicore cache publications Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level
Page Allocation,” MICRO 2006. Best paper nominee. Cho, Jin and Lee, “Achieving Predictable Performance with On-Chip Shared
L2 Caches for Manycore-Based Real-Time Systems,” RTCSA 2007. Invited paper.
Hammoud, Cho and Melhem, “ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors,” HiPEAC 2008.
Hammoud, Cho and Melhem, “Dynamic Cache Clustering for Chip Multiprocessors,” ICS 2008.
Jin and Cho, “Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches,” ICPP 2008.
Oh, Lee, Lee, and Cho, “An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor,” ISVLSI 2009.
Jin and Cho, “SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors,” PACT 2009.
Lee, Cho and Childers, “StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache,” HPCA 2010.
Hammoud, Cho and Melhem, “Cache Equalizer: A Placement Mechanism for Chip Multiprocessor Distributed Shared Caches,” HiPEAC 2011.
Lee, Cho and Childers, “CloudCache: Expanding and Shrinking Private Caches,” HPCA 2011.